clay codes: moulding mds codes to yield an msr codemyna/seminar/pdfs/myna_5_12_18.pdf ·...

139
Clay Codes: Moulding MDS Codes to Yield an MSR Code Myna Vajha ECE Student Seminar Series December 5, 2018 1 / 38

Upload: others

Post on 12-Apr-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Clay Codes: Moulding MDS Codes to Yield an MSR Code

Myna Vajha

ECE Student Seminar Series

December 5, 2018

1 / 38

Page 2: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Team

This is a joint work with:

1 Indian Institute of Science

I Vinayak Ramkumar, Bhagyashree Puranik, Ganesh Kini, Elita Lobo, Birenjith Sasidharan, Vijay

Kumar P

2 University of Maryland

I Alexandar Barg, Min Ye

3 NetApp ATG Bangalore

I Srinivasan Narayanamurthy, Syed Hussain, Siddhartha Nandi

2 / 38

Page 3: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Data Centers and Erasure Codes

3 / 38

Page 4: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Fault Tolerance

Fault tolerance is key tomaking data loss a veryremote possibility

A time-honored means ofachieving fault tolerance isreplication..

File or Object

Split it into blocks

A B C D

A A A

Triple replication

Stored in different nodes of the storage network

Figure: Tripe Replication Code used in Google File System

4 / 38

Page 5: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Fault Tolerance

Fault tolerance is key tomaking data loss a veryremote possibility

A time-honored means ofachieving fault tolerance isreplication..

File or Object

Split it into blocks

A B C D

A A A

Triple replication

Stored in different nodes of the storage network

Figure: Tripe Replication Code used in Google File System

4 / 38

Page 6: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Drawback of Triple Replication

But triple replication is poor in terms of storage overhead: 3x. Are there better ways ?

A well-known alternative is to use Erasure Coding (EC)

5 / 38

Page 7: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Drawback of Triple Replication

But triple replication is poor in terms of storage overhead: 3x. Are there better ways ?

A well-known alternative is to use Erasure Coding (EC)

5 / 38

Page 8: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Erasure Coding for Fault Tolerance

File or Object

Split it into chunks

Ak

Store the n chunks in different nodes of the storage network

A2A1

(n,k) erasure coden=k+m

P1 P2 Pm

k data chunks m parity chunks

The n chunks taken together, form a stripe.

Two Key Performance Measures

1 Storage Overhead nk

2 Fault Tolerance - at most m storage units

MDS Codes

1 For given (n, k), MDS erasure codes have the

maximum-possible fault tolerance

I Can tolerate m = n − k failures.

2 RAID 6 and Reed-Solomon(RS) codes are examples of MDScodes.

3 HDFS EC, Ceph have implementations of RS codes.

6 / 38

Page 9: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Erasure Coding for Fault Tolerance

File or Object

Split it into chunks

Ak

Store the n chunks in different nodes of the storage network

A2A1

(n,k) erasure coden=k+m

P1 P2 Pm

k data chunks m parity chunks

The n chunks taken together, form a stripe.

Two Key Performance Measures

1 Storage Overhead nk

2 Fault Tolerance - at most m storage units

MDS Codes

1 For given (n, k), MDS erasure codes have the

maximum-possible fault tolerance

I Can tolerate m = n − k failures.

2 RAID 6 and Reed-Solomon(RS) codes are examples of MDScodes.

3 HDFS EC, Ceph have implementations of RS codes.

6 / 38

Page 10: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Erasure Coding for Fault Tolerance

File or Object

Split it into chunks

Ak

Store the n chunks in different nodes of the storage network

A2A1

(n,k) erasure coden=k+m

P1 P2 Pm

k data chunks m parity chunks

The n chunks taken together, form a stripe.

Two Key Performance Measures

1 Storage Overhead nk

2 Fault Tolerance - at most m storage units

MDS Codes

1 For given (n, k), MDS erasure codes have the

maximum-possible fault tolerance

I Can tolerate m = n − k failures.

2 RAID 6 and Reed-Solomon(RS) codes are examples of MDScodes.

3 HDFS EC, Ceph have implementations of RS codes.

6 / 38

Page 11: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

An Example MDS Code - The RAID 6 Code

Source: https://upload.wikimedia.org/wikipedia/commons/thumb/7/70/RAID_6.svg/1280px-RAID_6.svg.png

7 / 38

Page 12: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Other RS Codes in Practice

Most Popular in Practice: Reed-Solomon Codes

8

Intel & Cloudera (2016) “Progress Report: Bringing Erasure Coding to Apache Hadoop”

Storage Systems Reed-Solomon codesLinux RAID-6 RS(10,8)Google File System II (Colossus) RS(9,6)Quantcast File System RS(9,6)Intel & Cloudera’ HDFS-EC RS(9,6)Yahoo Cloud Object Store RS(11,8)Backblaze’s online backup RS(20,17)Facebook’s f4 BLOB storage system RS(14,10)Baidu’s Atlas Cloud Storage RS(12, 8)

H. Dau et al, “Repairing Reed-Solomon Codes with Single and Multiple Erasures,” ITA, 2017, San Diego.

8 / 38

Page 13: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Evolution of HDFS to Incorporate EC ⇒ HDFS-EC

1 Typically, EC reduces the storage cost by 50% compared with 3x replication

2 Motivated by this, Cloudera and Intel initiated the HDFS-EC project

3 Available in Hadoop 3.0.

4 Employs a striped layout:

5 Possibility of incorporating more sophisticated EC schemes !

Zhe Zhang, Andrew Wang, Kai Zheng, Uma Maheswara G., and Vinayakumar, “Introduction to HDFS Erasure Coding in

Apache Hadoop,” September 23, 2015.9 / 38

Page 14: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Erasure Codes and Node Failures

A median of 50 nodes are unavailable per day.

98% of the failures are single node failures.

A median of 180TB of network traffic per day isgenerated in order to reconstruct the RS codeddata corresponding to unavailable machines.

Thus there is a strong need for erasure codesthat can efficiently recover from single-nodefailures.

Image courtesy: Rashmi et al.: “A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook

Warehouse Cluster,” USENIX Hotstorage, 2013.

10 / 38

Page 15: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Erasure Codes and Node Failures

A median of 50 nodes are unavailable per day.

98% of the failures are single node failures.

A median of 180TB of network traffic per day isgenerated in order to reconstruct the RS codeddata corresponding to unavailable machines.

Thus there is a strong need for erasure codesthat can efficiently recover from single-nodefailures.

Image courtesy: Rashmi et al.: “A Solution to the Network Challenges of Data Recovery in Erasure-coded Distributed Storage Systems: A Study on the Facebook

Warehouse Cluster,” USENIX Hotstorage, 2013.10 / 38

Page 16: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Conventional Node Repair of an RS Code

The conventional repair of an RS code is inefficient

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

10 X 100MB

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Data Chunk Parity Chunk Erased Chunk

In the example (14, 10) RS code,

1 the amount of data downloaded to repair 100MB of data equals 1GB.

clearly, there is room for improvement...

11 / 38

Page 17: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Conventional Node Repair of an RS Code

The conventional repair of an RS code is inefficient

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

10 X 100MB

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Data Chunk Parity Chunk Erased Chunk

In the example (14, 10) RS code,

1 the amount of data downloaded to repair 100MB of data equals 1GB.

clearly, there is room for improvement...

11 / 38

Page 18: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Conventional Node Repair of an RS Code

The conventional repair of an RS code is inefficient

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

10 X 100MB

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Data Chunk Parity Chunk Erased Chunk

In the example (14, 10) RS code,

1 the amount of data downloaded to repair 100MB of data equals 1GB.

clearly, there is room for improvement...

11 / 38

Page 19: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Conventional Node Repair of an RS Code

The conventional repair of an RS code is inefficient

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

10 X 100MB

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Data Chunk Parity Chunk Erased Chunk

In the example (14, 10) RS code,

1 the amount of data downloaded to repair 100MB of data equals 1GB.

clearly, there is room for improvement...

11 / 38

Page 20: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Coding Theory Responds

Regenera'ng(Codes(

Codes(with((Locality(

•  Regenera'ng(codes(reduce(repair(bandwidth(•  Codes(with(locality(reduce(repair(degree(

1 Regenerating codes

I minimize the amount of data download

(repair bandwidth) needed for node repair

2 Locally recoverable codes

I minimize the number of helper nodescontacted for node repair, but also reducerepair bandwidth

I Not MDS anymore

3 Novel and efficient approaches to RS repair amore recent development

A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran, “Network Coding for Distributed Storage Systems,” IEEE Trans. Inform.Th., Sep. 2010.

P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin, “On the Locality of Codeword Symbols,” IEEE Trans. Inf. Theory, Nov. 2012.

V. Guruswami, M. Wootters, “Repairing Reed-Solomon Codes,” arXiv:1509.04764 [cs.IT] .

12 / 38

Page 21: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Coding Theory Responds

Regenera'ng(Codes(

Codes(with((Locality(

•  Regenera'ng(codes(reduce(repair(bandwidth(•  Codes(with(locality(reduce(repair(degree(

1 Regenerating codes

I minimize the amount of data download

(repair bandwidth) needed for node repair

2 Locally recoverable codes

I minimize the number of helper nodescontacted for node repair, but also reducerepair bandwidth

I Not MDS anymore

3 Novel and efficient approaches to RS repair amore recent development

A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran, “Network Coding for Distributed Storage Systems,” IEEE Trans. Inform.Th., Sep. 2010.

P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin, “On the Locality of Codeword Symbols,” IEEE Trans. Inf. Theory, Nov. 2012.

V. Guruswami, M. Wootters, “Repairing Reed-Solomon Codes,” arXiv:1509.04764 [cs.IT] .

12 / 38

Page 22: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Coding Theory Responds

Regenera'ng(Codes(

Codes(with((Locality(

•  Regenera'ng(codes(reduce(repair(bandwidth(•  Codes(with(locality(reduce(repair(degree(

1 Regenerating codes

I minimize the amount of data download

(repair bandwidth) needed for node repair

2 Locally recoverable codes

I minimize the number of helper nodescontacted for node repair, but also reducerepair bandwidth

I Not MDS anymore

3 Novel and efficient approaches to RS repair amore recent development

A. G. Dimakis, P. B. Godfrey, Y. Wu, M. Wainwright, and K. Ramchandran, “Network Coding for Distributed Storage Systems,” IEEE Trans. Inform.Th., Sep. 2010.

P. Gopalan, C. Huang, H. Simitci, and S. Yekhanin, “On the Locality of Codeword Symbols,” IEEE Trans. Inf. Theory, Nov. 2012.

V. Guruswami, M. Wootters, “Repairing Reed-Solomon Codes,” arXiv:1509.04764 [cs.IT] .

12 / 38

Page 23: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Regenerating Codes1 We will deal here only in the subclass of regenerating codes known as Minimum Storage Regeneration

(MSR) codes

2 MSR codes are MDS and have least possible repair bandwidth

3 Repair bandwidth is defined as the total amount of data downloaded for repair of a single node

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

13 X 25MB

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Data Chunk Parity Chunk Erased Chunk

1 Size of failed node’s contents: 100MB

2 RS repair BW: 1 GB

3 MSR Repair BW: 325 MB

13 / 38

Page 24: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Regenerating Codes1 We will deal here only in the subclass of regenerating codes known as Minimum Storage Regeneration

(MSR) codes

2 MSR codes are MDS and have least possible repair bandwidth

3 Repair bandwidth is defined as the total amount of data downloaded for repair of a single node

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

100MB

13 X 25MB

1 2 3 4 5 6 7 8 9 10 11 12 13 14

Data Chunk Parity Chunk Erased Chunk

1 Size of failed node’s contents: 100MB

2 RS repair BW: 1 GB

3 MSR Repair BW: 325 MB

13 / 38

Page 25: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Key to the Impressive, Low-Repair BW of MSR Codes

In a nutshell: sub-packetization... we explain...

14 / 38

Page 26: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Key to the Impressive, Low-Repair BW of MSR Codes

In a nutshell: sub-packetization... we explain...

14 / 38

Page 27: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

n = k+m

Chunk

k data chunks m parity chunks

Page 28: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

n = k+m

Chunk

k data chunks m parity chunks

k

Page 29: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

n = k+m

Chunk

k data chunks m parity chunks

k

sub-chunkα

sub-packetization level

Page 30: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

n = k+m

Chunk

k data chunks m parity chunks

k

sub-chunkα

sub-packetization level < α

dk<d<n

Page 31: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

n = k+m

Chunk

k data chunks m parity chunks

k

sub-chunkα

sub-packetization level < α

dk<d<n

kα(1GB)

d<< kα

(325MB)

Page 32: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

n = k+m

Chunk

k data chunks m parity chunks

k

sub-chunkα

sub-packetization level < α

dk<d<n

kα(1GB)

d<< kα

(325MB)

= α/(d-k+1) is a fraction of α

Repair BW = dWe consider d=n-1, then Repair BW = (n-1)α/(n-k)

Page 33: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

n = k+m

Chunk

k data chunks m parity chunks

k

sub-chunkα

sub-packetization level < α

dk<d<n

kα(1GB)

d<< kα

(325MB)

Larger the m=n-k, larger the savings!!

= α/(d-k+1) is a fraction of α

Repair BW = dWe consider d=n-1, then Repair BW = (n-1)α/(n-k)

Page 34: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Additional Properties Desired of an MSR Code

1 Minimal Disk Read (IO Optimality): Read exactly what is needed to be transferred

2 Minimize sub-packetization level α

I sub-chunk size = chunk sizeα

= N bytes.I During repair, β sub-chunks are read.I If sub-chunks are not contiguous, only N bytes are read sequentially.I Smaller the α better the sequentiality!!

3 Small field size, low-complexity implementation.

15 / 38

Page 35: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Additional Properties Desired of an MSR Code

1 Minimal Disk Read (IO Optimality): Read exactly what is needed to be transferred

2 Minimize sub-packetization level α

I sub-chunk size = chunk sizeα

= N bytes.I During repair, β sub-chunks are read.I If sub-chunks are not contiguous, only N bytes are read sequentially.I Smaller the α better the sequentiality!!

3 Small field size, low-complexity implementation.

15 / 38

Page 36: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Additional Properties Desired of an MSR Code

1 Minimal Disk Read (IO Optimality): Read exactly what is needed to be transferred

2 Minimize sub-packetization level α

I sub-chunk size = chunk sizeα

= N bytes.I During repair, β sub-chunks are read.

I If sub-chunks are not contiguous, only N bytes are read sequentially.I Smaller the α better the sequentiality!!

3 Small field size, low-complexity implementation.

15 / 38

Page 37: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Additional Properties Desired of an MSR Code

1 Minimal Disk Read (IO Optimality): Read exactly what is needed to be transferred

2 Minimize sub-packetization level α

I sub-chunk size = chunk sizeα

= N bytes.I During repair, β sub-chunks are read.I If sub-chunks are not contiguous, only N bytes are read sequentially.I Smaller the α better the sequentiality!!

3 Small field size, low-complexity implementation.

15 / 38

Page 38: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Additional Properties Desired of an MSR Code

1 Minimal Disk Read (IO Optimality): Read exactly what is needed to be transferred

2 Minimize sub-packetization level α

I sub-chunk size = chunk sizeα

= N bytes.I During repair, β sub-chunks are read.I If sub-chunks are not contiguous, only N bytes are read sequentially.I Smaller the α better the sequentiality!!

3 Small field size, low-complexity implementation.

15 / 38

Page 39: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

4-way Optimality of Clay code

Least possible storage overhead(MDS Codes)

Least possible repair bandwidth(MSR Codes)

Least possible disk read(Optimal access MSR Codes)

Least possible sub-packetization(Clay Codes)

among the class of MSR codes, the Clay code is arguably a champion...

Image courtesy: denverpost.com

16 / 38

Page 40: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

4-way Optimality of Clay code

Least possible storage overhead(MDS Codes)

Least possible repair bandwidth(MSR Codes)

Least possible disk read(Optimal access MSR Codes)

Least possible sub-packetization(Clay Codes)

among the class of MSR codes, the Clay code is arguably a champion...

Image courtesy: denverpost.com

16 / 38

Page 41: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Placing the Clay Code in Perspective

Comparing the Clay code with repair-efficient codes that have undergone systems implementation

Code MDSLeast

Repair BW

Least Disk Read

Least α Restrictions

Implemented Distributed

Systems

Piggybacked RS(Sigcomm 2014)

✔ ✗ ✗ - None HDFS

Product Matrix(FAST 2015)

✔ ✔ ✔ ✔ Limited to Storage

Overhead > 2

Own System

Butterfly Code(FAST 2016)

✔ ✔ ✗ ✗ Limited to the 2 parity nodes

HDFS, Ceph

HashTag Code(Trans. on Big Data

2017)

✔ ✗ ✗ - Only systematic node

repair

HDFS

Clay(FAST 2018)

✔ ✔ ✔ ✔ None! Ceph

The Butterfly, HashTag codes have least disk read for systematic node repair.

17 / 38

Page 42: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Clay Code Construction

18 / 38

Page 43: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Moulding an MDS Code to Yield a (4, 2) Clay Code

(0,0) (0,1)(1,0) (1,1)

ParityData

Two sub-chunks are encoded using (4, 2)scalar MDS code.

→z=0

xy

z=1

z=2

z=3

Layer four such units.

z= (0,0)

z= (1,1)

z= (1,0)

z= (0,1)

Index each layer z using two bits(corresponding to the location of the two

red dots in that layer).

U*U

sub-chunks such as (U,U∗) are paired(yellow rectangles connected by a dotted

line).

Pairwise Forward Transform (PFT)

CC*

U

U*= A

Any two sub-chunks out of{U,U∗,C ,C∗} can be computed

from remaining two.

CC*

Perform PFT on paired sub-chunks andcopy the unpaired sub-chunks to get the

Clay code.

Can be generalized to any (n, k , d)!!

19 / 38

Page 44: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Moulding an MDS Code to Yield a (4, 2) Clay Code

(0,0) (0,1)(1,0) (1,1)

ParityData

Two sub-chunks are encoded using (4, 2)scalar MDS code.

→z=0

xy

z=1

z=2

z=3

Layer four such units.

z= (0,0)

z= (1,1)

z= (1,0)

z= (0,1)

Index each layer z using two bits(corresponding to the location of the two

red dots in that layer).

U*U

sub-chunks such as (U,U∗) are paired(yellow rectangles connected by a dotted

line).

Pairwise Forward Transform (PFT)

CC*

U

U*= A

Any two sub-chunks out of{U,U∗,C ,C∗} can be computed

from remaining two.

CC*

Perform PFT on paired sub-chunks andcopy the unpaired sub-chunks to get the

Clay code.

Can be generalized to any (n, k , d)!!

19 / 38

Page 45: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Moulding an MDS Code to Yield a (4, 2) Clay Code

(0,0) (0,1)(1,0) (1,1)

ParityData

Two sub-chunks are encoded using (4, 2)scalar MDS code.

→z=0

xy

z=1

z=2

z=3

Layer four such units.

z= (0,0)

z= (1,1)

z= (1,0)

z= (0,1)

Index each layer z using two bits(corresponding to the location of the two

red dots in that layer).

U*U

sub-chunks such as (U,U∗) are paired(yellow rectangles connected by a dotted

line).

Pairwise Forward Transform (PFT)

CC*

U

U*= A

Any two sub-chunks out of{U,U∗,C ,C∗} can be computed

from remaining two.

CC*

Perform PFT on paired sub-chunks andcopy the unpaired sub-chunks to get the

Clay code.

Can be generalized to any (n, k , d)!!

19 / 38

Page 46: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Moulding an MDS Code to Yield a (4, 2) Clay Code

(0,0) (0,1)(1,0) (1,1)

ParityData

Two sub-chunks are encoded using (4, 2)scalar MDS code.

→z=0

xy

z=1

z=2

z=3

Layer four such units.

z= (0,0)

z= (1,1)

z= (1,0)

z= (0,1)

Index each layer z using two bits(corresponding to the location of the two

red dots in that layer).

U*U

sub-chunks such as (U,U∗) are paired(yellow rectangles connected by a dotted

line).

Pairwise Forward Transform (PFT)

CC*

U

U*= A

Any two sub-chunks out of{U,U∗,C ,C∗} can be computed

from remaining two.

CC*

Perform PFT on paired sub-chunks andcopy the unpaired sub-chunks to get the

Clay code.

Can be generalized to any (n, k , d)!!

19 / 38

Page 47: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Moulding an MDS Code to Yield a (4, 2) Clay Code

(0,0) (0,1)(1,0) (1,1)

ParityData

Two sub-chunks are encoded using (4, 2)scalar MDS code.

→z=0

xy

z=1

z=2

z=3

Layer four such units.

z= (0,0)

z= (1,1)

z= (1,0)

z= (0,1)

Index each layer z using two bits(corresponding to the location of the two

red dots in that layer).

U*U

sub-chunks such as (U,U∗) are paired(yellow rectangles connected by a dotted

line).

Pairwise Forward Transform (PFT)

CC*

U

U*= A

Any two sub-chunks out of{U,U∗,C ,C∗} can be computed

from remaining two.

CC*

Perform PFT on paired sub-chunks andcopy the unpaired sub-chunks to get the

Clay code.

Can be generalized to any (n, k , d)!!

19 / 38

Page 48: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Moulding an MDS Code to Yield a (4, 2) Clay Code

(0,0) (0,1)(1,0) (1,1)

ParityData

Two sub-chunks are encoded using (4, 2)scalar MDS code.

→z=0

xy

z=1

z=2

z=3

Layer four such units.

z= (0,0)

z= (1,1)

z= (1,0)

z= (0,1)

Index each layer z using two bits(corresponding to the location of the two

red dots in that layer).

U*U

sub-chunks such as (U,U∗) are paired(yellow rectangles connected by a dotted

line).

Pairwise Forward Transform (PFT)

CC*

U

U*= A

Any two sub-chunks out of{U,U∗,C ,C∗} can be computed

from remaining two.

CC*

Perform PFT on paired sub-chunks andcopy the unpaired sub-chunks to get the

Clay code.

Can be generalized to any (n, k , d)!!

19 / 38

Page 49: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Moulding an MDS Code to Yield a (4, 2) Clay Code

(0,0) (0,1)(1,0) (1,1)

ParityData

Two sub-chunks are encoded using (4, 2)scalar MDS code.

→z=0

xy

z=1

z=2

z=3

Layer four such units.

z= (0,0)

z= (1,1)

z= (1,0)

z= (0,1)

Index each layer z using two bits(corresponding to the location of the two

red dots in that layer).

U*U

sub-chunks such as (U,U∗) are paired(yellow rectangles connected by a dotted

line).

Pairwise Forward Transform (PFT)

CC*

U

U*= A

Any two sub-chunks out of{U,U∗,C ,C∗} can be computed

from remaining two.

CC*

Perform PFT on paired sub-chunks andcopy the unpaired sub-chunks to get the

Clay code.

Can be generalized to any (n, k , d)!!19 / 38

Page 50: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Generic Construction: 3D Representation of a Codeword

(n = st, k, d), (α = s t , β = s t−1, Fq), s = d − k + 1

s = 4, t = 5

y=0 1 2 3 4x=0123

z = (3, 2, 3, 1, 0)

There are nα = s × t × s t code symbols in Fq.

They can be indexed by 3-tuple (x , y ; z) where x ∈ Zs , y ∈ Zt , z ∈ Zts .

(x , y) tuple indicates node, z index the symbols index within α symbols.

Pair symbol to C(x , y , z) is obtained by simply replacing x with zy

C∗(x , y , z) = C(zy , y , z(y , x))

where zy 6= x , z(y , x) = (z0, · · · , zy−1, x , zy+1, · · · , zt−1)

PFT is performed to get U(x , y , z) where[U(x , y , z)U∗(x , y , z)

]=

[1 γγ 1

] [C(x , y , z)C∗(x , y , z)

]For every z ∈ Zt

s , the collection of symbols {U(x , y , z) | x ∈ Zs , y ∈ Zt}20 / 38

Page 51: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Encoding the Clay Code

The previous slide did not explain how encoding takes place as the code was not insystematic form.

We will now explain encoding data under the Clay Code.

21 / 38

Page 52: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

64MB

Consider a file of size 64MB

● We show encoding of the file using (n = 4, k = 2) Clay code.

Page 53: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Break the file into k = 2 data chunks each of 32MB.32MB 32MB

Page 54: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

3D cube representation of Clay Code32MB 32MB

z = (0,0)

z = (1,1)

z

yx The cube has:

● 4 columns, which correspond to the 4 chunks (each of size 32MB, stored in a different disk/node).

● 4 horizontal planes.

● Each column has 4 points that correspond to sub-chunks of size 8MB

Page 55: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Place two 32MB chunks in two data nodes32MB

z = (0,0)

z

yx

z = (1,1)

Page 56: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Place two 32MB chunks in two data nodes

z = (0,0)

z = (1,1)

z

yx

Page 57: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

We now have the data nodes

Page 58: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

We will now compute the parity nodes

Page 59: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Will get there through an intermediate “Uncoupled data cube”

Page 60: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Start filling the Uncoupled data cube on the right as follows

Page 61: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

CC*

Certain pairs of points in the cube are “coupled”

Page 62: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

CC*

U U*

PRT

C C*

PRT is a 2x2 matrix transform, It is reverse of PFT

Page 63: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

CC* U U*

Place the sub-chunks obtained in the uncoupled data cube

Page 64: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

CC*

UU*

Place the sub-chunks obtained in the uncoupled data cube

Page 65: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

CC*

Place the sub-chunks obtained in the uncoupled data cube

Page 66: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

CC*

U U*

PRT

C C*

Place the sub-chunks obtained in the uncoupled data cube

Page 67: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

CC*

U U*

Place the sub-chunks obtained in the uncoupled data cube

Page 68: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

CC*

UU*

Place the sub-chunks obtained in the uncoupled data cube

Page 69: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Place the sub-chunks obtained in the uncoupled data cube

Page 70: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Copy

Red dotted sub-chunks are not paired, they are simply carried over

Page 71: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Red dotted sub-chunks are not paired, they are simply carried over

Copy

Page 72: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

We now have data-part of the uncoupled data cube

Page 73: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

z = (0,0)

Each plane is Reed-Solomon encoded to obtain parity points (sub-chunks)

Page 74: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

z = (0,0)

RS Encode

(4,2)

Each plane is Reed-Solomon encoded to obtain parity points (sub-chunks)

Page 75: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

z = (0,0)

RS Encode

(4,2)

Each plane is Reed-Solomon encoded to obtain parity points (sub-chunks)

Page 76: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

z = (0,0)

Each plane is Reed-Solomon encoded to obtain parity points (sub-chunks)

Page 77: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

z = (1,0)

RS Encode

(4,2)

Each plane is Reed-Solomon encoded to obtain parity points (sub-chunks)

Page 78: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

z = (0,1)

RS Encode

(4,2)

Each plane is Reed-Solomon encoded to obtain parity points (sub-chunks)

Page 79: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

z = (1,1)

RS Encode

(4,2)

Each plane is Reed-Solomon encoded to obtain parity points (sub-chunks)

Page 80: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Now we have the complete Uncoupled data cube

Page 81: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Parity sub-chunks of Coupled data cube can now be computed

Page 82: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

U

U*

Perform PFT

Page 83: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

C C*

PFT

U U* U

U*

Perform PFT

Page 84: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

C C*

Perform PFT

Page 85: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Perform PFT

Page 86: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

C C*

PFT

U U*

U

U*

Perform PFT

Page 87: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

C C*

Perform PFT

Page 88: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Perform PFT

Page 89: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Copy

Red dotted sub-chunks are simply carried over

Page 90: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Red dotted sub-chunks are simply carried over

Copy

Page 91: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

The encoding is now complete!

Page 92: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Recovery from single node failure

22 / 38

Page 93: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Node Repair: One node fails

Page 94: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Only half of planes participate in repair

● Total Helper Data = 8MB X 3 X 2 = 48MB

● As opposed to RS code = 8MB X 2 X 4 = 64MB

● Much larger savings seen for m > 2

Page 95: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Perform PRT to get possible uncoupled sub-chunks

PRT

Page 96: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Run RS decoding on each of the selected planes

PRT

RS Decode

Page 97: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Run RS decoding on each of the selected planes

PRT

RS Decode

Page 98: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Run RS decoding on each of the selected planes

PRT

RS Decode

Page 99: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

PRT

Run RS decoding on each of the selected planes

RS Decode

Page 100: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

We now have the following sub-chunks available

Page 101: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Half the number of required sub-chunks are now already computed

Copy

Page 102: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Compute C* from C and U

C,UC*

Page 103: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Compute C* from C and U

C,UC*

Page 104: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Compute C* from C and U

C,UC*

Page 105: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Compute C* from C and U

C,UC*

Page 106: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Replacement node

Content of failed node is now completely recovered

Page 107: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

MDS Property of Clay Code

Any n − k node failures can be recovered from.

The decoding algorithm recovers the lost symbols layer by layer sequentially.

It uses functions scalar MDS decode, PFT, PRT and the function that computes U from {U∗,C}.

Decoding algorithm involves α scalar MDS decode operations along with 2nβ Galois field scalarmultiplications and nβ Galois XOR operations.

RS decode for the same amount of data involves α scalar MDS decode operations.

23 / 38

Page 108: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Implementation and Evaluation of Clay Code

24 / 38

Page 109: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Ceph: ArchitectureObject Storage Daemon (OSD): process of Ceph, associated with a storage unit.

Pool: Logical partitions, associated with an erasure-code profile.

Placement Group(PG): Collection of n OSDs.

Each pool can have a single or multiple PGs associated with it.

PG1

OSD7 OSD1OSD4

OSD5OSD2

POOL

OSD3

p-OSD

PG2

OSD4 OSD3OSD5

OSD1OSD7 OSD6

p-OSD

Erasure Code Profile

OBJECT

OBJECT OBJECT

25 / 38

Page 110: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Recovery Code Flow in Ceph

26 / 38

Page 111: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Ceph: Contributions

We introduced the notion of sub-chunking to enable use of vector erasure codes withCeph.

Clay code is available as an erasure code plugin in Ceph for all parameters (n, k , d)

Both these features are currently available in master codebase of ceph.

Clay code will be available in Ceph’s next release nautilus.

27 / 38

Page 112: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Evaluation of the Clay Code

Evaluated on a 26 node (m4.xlarge) AWS cluster.

One node hosts Monitor (MON) process of Ceph.

Remaining 25 nodes host one OSD each.

Each node has 500GB SSD type volume attached.

Two workloads

I Workload W1: fixed size 64MB objects → stripe size 64MBI Workload W2: mixture of 1MB, 32MB, and 64MB size objects, → stripe size 1MB

Both single PG and multiple PG (512 PG) experiments.

Codes evaluated: (6, 4, 5), (12, 9, 11) and (20, 16, 19).

28 / 38

Page 113: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Network Traffic and Disk Read : W1 Workload, 1 PG

Network traffic reduced to 75%, 48%, 34% ofthat of RS as predicted by theory.

Repair disk read reduced to 62%, 41%, 29% ofthat of RS as predicted by theory.

29 / 38

Page 114: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Network Traffic and Disk Read : W2 Workload, 1 PG

Network traffic reduced to 75%, 48%, 34% ofthat of RS matching the theoretical values.

Reductions same as that for W1.

Disk read for (6, 4, 5) code is optimal

For (12, 9, 11) and (20, 16, 19) codes effect offragmented read is observed.

30 / 38

Page 115: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Fragmented Read

Best and worst case, disk read during repair of(20,16,19) code for stripe sizes 1MB, 64MB

During repair of a chunk only β < α sub-chunksare read from each helper nodes.

During worst case failures, the sub-chunks neededin repair are not located contiguously.

sub-chunk size = stripe size/kα

For (20,16,19) code α = 1024, k = 16. Therefore,for stripe sizes 64MB and 1MB, the sub-chunksizes are 4KB, 64B respectively.

If sub-chunk size is aligned to 4kB (SSD pagegranularity), the fragmented-read problem can beavoided.

31 / 38

Page 116: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Repair Time and Encoding Time: W1 Workload, 1 PG

Repair time reduced by 1.49x, 2.34x, 3x of that ofRS.

The total encoding time remains almost same asthat of RS.

While, encode computation time of Clay code ishigher than that of RS code by 70%.

This is due to the additional PFT and PRToperations.

32 / 38

Page 117: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Normal and Degraded I/O : W1 workload, 1 PG

Better degraded read 16.24%, 9.9%, 27.17% and write throughput increased by 4.52%, 13.58%, 106.68%of that of RS.

Normal read and write throughput same as that of RS.

33 / 38

Page 118: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Network Traffic and Disk Read : W1 workload, 512 PG

Assignment of OSDs and objects to PGs is dynamic.

I Number of objects affected by failure of an OSD can vary across different runs of multiple-PG

experiment.

Sometimes an OSD that is already part of the PG can get reassigned as replacement for the failed OSD.

I Number of failures are treated as two resulting in inferior network-traffic performance in multiple-PG

setting.

34 / 38

Page 119: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Multiple Node Failures

Average theoretical network traffic during repair of64MB object.

Workload W1, 512 PG

Network traffic increases with increase in numberof failed chunks.

35 / 38

Page 120: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Conclusions

We provide an open-source implementation of Clay code for any (n, k , d) parameters.

The theoretical promise of the Clay code is reflected in theevaluation presented here

Specifically, for Workloads with large sized objects, the Claycode (20, 16, 19):

I resulted in repair time reduction by 3x .

I Improved degraded read and write performance by 27.17%and 106.68% respectively.

In summary, Clay Codes are well

poised to make the leap from theory to

practice!!!

36 / 38

Page 121: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Conclusions

We provide an open-source implementation of Clay code for any (n, k , d) parameters.

The theoretical promise of the Clay code is reflected in theevaluation presented here

Specifically, for Workloads with large sized objects, the Claycode (20, 16, 19):

I resulted in repair time reduction by 3x .

I Improved degraded read and write performance by 27.17%and 106.68% respectively.

In summary, Clay Codes are well

poised to make the leap from theory to

practice!!!

36 / 38

Page 122: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Conclusions

We provide an open-source implementation of Clay code for any (n, k , d) parameters.

The theoretical promise of the Clay code is reflected in theevaluation presented here

Specifically, for Workloads with large sized objects, the Claycode (20, 16, 19):

I resulted in repair time reduction by 3x .

I Improved degraded read and write performance by 27.17%and 106.68% respectively.

In summary, Clay Codes are well

poised to make the leap from theory to

practice!!!

36 / 38

Page 123: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Conclusions

We provide an open-source implementation of Clay code for any (n, k , d) parameters.

The theoretical promise of the Clay code is reflected in theevaluation presented here

Specifically, for Workloads with large sized objects, the Claycode (20, 16, 19):

I resulted in repair time reduction by 3x .

I Improved degraded read and write performance by 27.17%and 106.68% respectively.

In summary, Clay Codes are well

poised to make the leap from theory to

practice!!!

36 / 38

Page 124: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Thank You!

37 / 38

Page 125: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Backup Slides!

38 / 38

Page 126: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Decode: Two nodes fail

Page 127: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Assign Intersection Score to each plane

z = (0,0) z = (1,0)

z = (0,1) z = (1,1)

Intersection score is given by the number of hole-dot pairs

Page 128: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

Assign Intersection Score to each plane

z = (0,0) z = (1,0)

z = (0,1) z = (1,1)

IS=1

IS=0

IS=2

IS=1

Intersection score is given by the number of hole-dot pairs

Page 129: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

For non erased nodes, get the uncoupled sub-chunks for planes with IS=0

IS=1

IS=2

IS=0

IS=1

Page 130: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

RS decode to get the remaining uncoupled-subchunks

IS=1

IS=2

IS=0

IS=1

RS Decode

Page 131: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

IS=1

IS=2

IS=0

IS=1

U2*U1*

C1

C2

Known sub-chunks

We now have following sub-chunks

Page 132: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

IS=1

IS=2

IS=0

IS=1

U2*U1*

C1

C2 U2

U1

Known sub-chunks

Get U2 from U2* and C2

Get U1 from U1* and C1

For non erased nodes, get the uncoupled sub-chunks for planes with IS=1

Page 133: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

IS=1

IS=2

IS=0

IS=1

U2*U1*

C1

C2 U2

U1

RS decode to get the remaining uncoupled-subchunks

RS Decode

RS Decode

Known sub-chunks

Page 134: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

IS=1

IS=2

IS=0

IS=1

C1C2

U1*

Known sub-chunks

We now have the following sub-chunks

U2*

Page 135: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

IS=1

IS=2

IS=0

IS=1

C1C2

U1*

Known sub-chunks

U2*

Get U2 from U2* and C2

Get U1 from U1* and C1

U2

U1

For non erased nodes, get the uncoupled sub-chunks for planes with IS=2

Page 136: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

IS=1

IS=2

IS=0

IS=1

C1C2

U1*

RS Decode

Known sub-chunks

Get the uncoupled sub-chunks for planes with IS=2

U2*

Get U2 from U2* and C2

Get U1 from U1* and C1

U2

U1

Page 137: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

We now have all the uncoupled sub chunks

Page 138: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

The coupled sub chunks can now be computed using PFT

PFT

Page 139: Clay Codes: Moulding MDS Codes to Yield an MSR Codemyna/seminar/pdfs/Myna_5_12_18.pdf · 2019-03-18 · Myna Vajha ECE Student Seminar Series December 5, 2018 1/38. Team This is a

The decoding is now complete