erasure codes and storage tiers on gluster
TRANSCRIPT
Dan Lambright1
Erasure Codes and Storage Tiers onGluster
Dan LambrightSA summitSep 23, 2014
Dan Lambright2
AGENDA
● Why erasure codes (ec) in Gluster● How ec works
● Brief peek at underlying mathematics● Storage tiering in gluster ● Demo● “One more thing”
Dan Lambright3
Why erasure codes in gluster?
● Desire protection from double failure
● RAID6 controllers are expensive
● Imagine a 64 node volume● Each brick on a separate bare metal machine● Cost is 64 x $ for LSI MegaRaid controller
20K
=
Dan Lambright4
Why erasure codes in gluster?
● Triplication (3 way replication) is expensive
● Two redundant disks for every data disk
● 200% overhead! :(
Dan Lambright5
Erasure codes
● Store m disks worth of data on k disks (k>m)
● n redundant disks (k-m),
● can pick n to choose failure tolerance● A generalization of RAID6
● Distributed across nodes
Dan Lambright6
Overhead analysis
● Can also consider mean time before failure
k total disks n how many failures admitted
m number of data disks
Capacity overhead(n/k)
RAID level
3 1 2 33.33% 5
5 1 4 20% 5
6 2 4 33.33% 6
7 3 4 42.86% E
9 1 8 11.11% 5
10 2 8 20% 6
11 3 8 27.27% E
12 4 8 33.33% E
ERASURE CODES PRIMER
Dan Lambright8
ERASURE CODE TERMS
● m data disks
● n parity disks
● k total number disks = m+n
● Symbol – Smallest data unit. w bits.● Typically w = 8 = a byte
● Chunk (aka fragment) – r symbols per disk
● Stripe – collection of m+n chunks across k disks● Unit of manipulation for recovery● Also known as a “slice”
Dan Lambright9
ERASURE CODE TERMS
●
r=6m=4n =2k=6w=1
symbol
fragment
“Stripe” of 6 fragments
011010
Dan Lambright10
Systematic
● m data chunks, n coding chunks
● (can stripe parity and data chunks on the same disk)● Reads are simple, only decode on repairs
Slice 1
Slice 2
Slice 3
Dan Lambright11
Non-Systematic
● All k chunks in a stripe are coded
● Do not to distinguish data from code servers
● Encode/decode on writes and reads
Slice 1
Slice 2
Slice 3
Dan Lambright12
Encoding / Decoding Overhead
● Network RTT dominate the encode/decode overhead
● Packages exist to implement the math ● Intel has fast routines for Inverse, dot product,
encoding, decoding, etc● Jerasure library from academia● Gluster's is purpose built and fast
GLUSTER IMPLEMENTATION
Dan Lambright14
GLUSTERFS “Disperse Volumes”
● Done by Datalab corp. by Xavier Hernandez.● Use case : archiving medical records● Developed over last 2 years● Now part of gluster upstream
Dan Lambright15
CLI
Two new options have been added to the 'create' command of the cli interface:
gluster volume create <name> disperse <count> redundancy <count>
Disperse is “k” (total number volumes)
Redundancy is “n”
Dan Lambright16
“Disperse volumes” design choices
● The “symbols” are bytes: w = 8
● The fragment size r = 128
● Algorithm: Reed solomon
● Generator matrix: Vandermonde
● Non–systematic
● Encoding / decoding done on client side
● Modeled after AFR● Concurrent writes must be processed in order
STORAGE TIERS
Dan Lambright18
Storage Tiers
● Different “subvolume” tiers presented as a single volume
● HDD, SSD, tape, “persistent memory”, etc.
● Plug-in policy describes how data moves between tiers
● V1 policy: Cache
● slow and fast tiers
● CLI to add/remove cache tier from existing volume
Dan Lambright19
Example: Erasure codes + SSD
● User sees one volume
● SSD “caches” ec data
Tiered volume
“cache”:on SSD
econ HDD
Hot Cold
demotepromote
Dan Lambright20
Future : Data classification (DC)
● Add rules to storage graph
● Rule determines subvolume
● File name● Attribute (size, content)● Etc.
Filename =*.lock ?`
Yes No
Secure / Encrypted
HDD
Dan Lambright21
Future flexibility
● Many use cases● Compliance● Multi-tenancy● Rack-aware placement (for performance)
● Policies described by language● Arbitrary number of tiers, rules, subvolumes ..● Template based
DEMO
promote
ONE MORE THING..
promote
Dan Lambright24
Bitrot
● A daemon that scans gluster volumes● Finds corrupted data● Digest associated with each file● Alert / recover on mismatch
● “Plug-ins” to daemon may do other things..● Tuning parameters to be non-intrusive to performance● Encryption● Compression● Etc.
25
Do it!
● Learn the math:● http://web.eecs.utk.edu/~plank/plank/papers/FAST-
2013-Tutorial.html
● Get the bits: ● https://forge.gluster.org/disperse
RED HAT CONFIDENTIAL – DO NOT DISTRIBUTE
Thank You!
● RHS:
www.redhat.com/storage/
● GlusterFS:
www.gluster.org
●
@Glusterorg
@RedHatStorage
Gluster
Red Hat Storage
Slides Available on Mojo