Download - Erasure codes and storage tiers on gluster
![Page 1: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/1.jpg)
Dan Lambright1
Erasure Codes and Storage Tiers onGluster
Dan LambrightSA summitSep 23, 2014
![Page 2: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/2.jpg)
Dan Lambright2
AGENDA
● Why erasure codes (ec) in Gluster● How ec works
● Brief peek at underlying mathematics● Storage tiering in gluster ● Demo● “One more thing”
![Page 3: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/3.jpg)
Dan Lambright3
Why erasure codes in gluster?
● Desire protection from double failure
● RAID6 controllers are expensive
● Imagine a 64 node volume● Each brick on a separate bare metal machine● Cost is 64 x $ for LSI MegaRaid controller
20K
=
![Page 4: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/4.jpg)
Dan Lambright4
Why erasure codes in gluster?
● Triplication (3 way replication) is expensive
● Two redundant disks for every data disk
● 200% overhead! :(
![Page 5: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/5.jpg)
Dan Lambright5
Erasure codes
● Store m disks worth of data on k disks (k>m)
● n redundant disks (k-m),
● can pick n to choose failure tolerance● A generalization of RAID6
● Distributed across nodes
![Page 6: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/6.jpg)
Dan Lambright6
Overhead analysis
● Can also consider mean time before failure
k total disks n how many failures admitted
m number of data disks
Capacity overhead(n/k)
RAID level
3 1 2 33.33% 5
5 1 4 20% 5
6 2 4 33.33% 6
7 3 4 42.86% E
9 1 8 11.11% 5
10 2 8 20% 6
11 3 8 27.27% E
12 4 8 33.33% E
![Page 7: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/7.jpg)
ERASURE CODES PRIMER
![Page 8: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/8.jpg)
Dan Lambright8
ERASURE CODE TERMS
● m data disks
● n parity disks
● k total number disks = m+n
● Symbol – Smallest data unit. w bits.● Typically w = 8 = a byte
● Chunk (aka fragment) – r symbols per disk
● Stripe – collection of m+n chunks across k disks● Unit of manipulation for recovery● Also known as a “slice”
![Page 9: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/9.jpg)
Dan Lambright9
ERASURE CODE TERMS
●
r=6m=4n =2k=6w=1
symbol
fragment
“Stripe” of 6 fragments
011010
![Page 10: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/10.jpg)
Dan Lambright10
Systematic
● m data chunks, n coding chunks
● (can stripe parity and data chunks on the same disk)● Reads are simple, only decode on repairs
Slice 1
Slice 2
Slice 3
![Page 11: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/11.jpg)
Dan Lambright11
Non-Systematic
● All k chunks in a stripe are coded
● Do not to distinguish data from code servers
● Encode/decode on writes and reads
Slice 1
Slice 2
Slice 3
![Page 12: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/12.jpg)
Dan Lambright12
Encoding / Decoding Overhead
● Network RTT dominate the encode/decode overhead
● Packages exist to implement the math ● Intel has fast routines for Inverse, dot product,
encoding, decoding, etc● Jerasure library from academia● Gluster's is purpose built and fast
![Page 13: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/13.jpg)
GLUSTER IMPLEMENTATION
![Page 14: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/14.jpg)
Dan Lambright14
GLUSTERFS “Disperse Volumes”
● Done by Datalab corp. by Xavier Hernandez.● Use case : archiving medical records● Developed over last 2 years● Now part of gluster upstream
![Page 15: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/15.jpg)
Dan Lambright15
CLI
Two new options have been added to the 'create' command of the cli interface:
gluster volume create <name> disperse <count> redundancy <count>
Disperse is “k” (total number volumes)
Redundancy is “n”
![Page 16: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/16.jpg)
Dan Lambright16
“Disperse volumes” design choices
● The “symbols” are bytes: w = 8
● The fragment size r = 128
● Algorithm: Reed solomon
● Generator matrix: Vandermonde
● Non–systematic
● Encoding / decoding done on client side
● Modeled after AFR● Concurrent writes must be processed in order
![Page 17: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/17.jpg)
STORAGE TIERS
![Page 18: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/18.jpg)
Dan Lambright18
Storage Tiers
● Different “subvolume” tiers presented as a single volume
● HDD, SSD, tape, “persistent memory”, etc.
● Plug-in policy describes how data moves between tiers
● V1 policy: Cache
● slow and fast tiers
● CLI to add/remove cache tier from existing volume
![Page 19: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/19.jpg)
Dan Lambright19
Example: Erasure codes + SSD
● User sees one volume
● SSD “caches” ec data
Tiered volume
“cache”:on SSD
econ HDD
Hot Cold
demotepromote
![Page 20: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/20.jpg)
Dan Lambright20
Future : Data classification (DC)
● Add rules to storage graph
● Rule determines subvolume
● File name● Attribute (size, content)● Etc.
Filename =*.lock ?`
Yes No
Secure / Encrypted
HDD
![Page 21: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/21.jpg)
Dan Lambright21
Future flexibility
● Many use cases● Compliance● Multi-tenancy● Rack-aware placement (for performance)
● Policies described by language● Arbitrary number of tiers, rules, subvolumes ..● Template based
![Page 22: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/22.jpg)
DEMO
promote
![Page 23: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/23.jpg)
ONE MORE THING..
promote
![Page 24: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/24.jpg)
Dan Lambright24
Bitrot
● A daemon that scans gluster volumes● Finds corrupted data● Digest associated with each file● Alert / recover on mismatch
● “Plug-ins” to daemon may do other things..● Tuning parameters to be non-intrusive to performance● Encryption● Compression● Etc.
![Page 25: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/25.jpg)
25
Do it!
● Learn the math:● http://web.eecs.utk.edu/~plank/plank/papers/FAST-
2013-Tutorial.html
● Get the bits: ● https://forge.gluster.org/disperse
![Page 26: Erasure codes and storage tiers on gluster](https://reader036.vdocuments.site/reader036/viewer/2022062300/55c4d3bfbb61ebaf1c8b45e8/html5/thumbnails/26.jpg)
RED HAT CONFIDENTIAL – DO NOT DISTRIBUTE
Thank You!
● RHS:
www.redhat.com/storage/
● GlusterFS:
www.gluster.org
●
@Glusterorg
@RedHatStorage
Gluster
Red Hat Storage
Slides Available on Mojo