CS 443 Advanced OS
Fabián E. Bustamante, Spring 2005
Glacier: Highly Durable, Decentralized Storage Despite Massive Correlated
Failures
Andreas Haeberlen, Alan Mislove, and Peter Druschel
Presenter: Yi Qiao
2
Outline
Introduction
Related Work
Assumption and Targeted Environment
Glacier
Object Aggregation
Security
Evaluation
Conclusion
3
Introduction
How to achieve high availability in decentralized storage systems?
Replication
Problems– Failure is not independent– Worms make the problem worse– Catastrophic effects of losing some data
Glacier– A distributed storage system that is robust to large-scale
correlated failures• Highly durable, decentralized storage
– Trades efficiency of storage for durability– No any assumptions about the nature and correlation of failures– Aggregation of small objects and a fragment maintenance protocol to
reduce the message overhead
4
Related Works
OceanStore and Phoenix– Apply introspection to defend against correlated failures
• Difficult to capture all correlations
• Introspection itself can make the system vulnerable to attacks
– Glacier relies on minimal assumptions about the nature of failures by using larger storage overhead
TotalRecall– Optimizes availability under churn, no worst-case
guarantees
PAST, Farsite– Replication against data loss
Weatherspoon et al.– Erasure codes can achieve better MTTF than plain
replication
5
Assumptions and Intended Environment
Intended to be used in an environment consisting of desktop computers within an organizational intranet– Some fraction of nodes can be home desktops connected via DSL
or wireless LAN
– Modest amount of churn and good network connectivity
– Used on combination with conventional decentralized replication storage layer
Lifetime – hundreds of days; session time – days or hours
Three operation model– Normal operation
– Large-scale failure – up to a fraction of fmax nodes failures• Protecting data stored on non-faulty nodes
– Recovery mode • reconstitutes aggregates and restores missing fragments
6
Glacier
Participating storage nodes form an overlay network– The set of keys forms a circular space– Each node stores objects with keys in their own key
segment– Uses underlying DHT layer for secure routing and
communication
Operates along with a primary store with full replicas
Aggregation of small objects
Erasure coding of aggregation
Fragments placement at random selected nodes
7
Glacier
Durability guarantee – f<=fmax– Each object survives with probability p>=pmin
Application Interface– put(i,v,o,l)– get(i,v) o– refresh(i,v,l)– No primitives for deletion or overwriting
• Leases are used, can be renewed when necessary
8
Glacier
Fragments and manifests– Erasure code – reduce storage overhead
• Object O of size (O) is stored as n fragments F1, F2, .. Fn of size (O)/r, any r of which are sufficient for object restore
• Object key k, Fragment Fi key (k,i,v)
– Object authenticator and Manifest • AO=(H(O), H(F1), H(F2), …, H(Fn), v, l)• Corrupted fragments can be detected and removed
Key ownership– Keys are assigned by consistent hashing over the
set of nodes that are either on-line or were on-line within a period Tmax
9
Glacier
Fragment placement– Fragments of the same object
placed on different random chosen nodes
– Fragments of objects with similar keys should be grouped together
– Placement function should be stable
– P(k,i,v)=k+i/(n+1)+H(v)• Primary replica – position k
• n fragments – n+1 equidistant points in the circular space
• H(v) – prevents load imbalance
– When inserting new object (k,v), if owner of P(k,i,v) is offline, discard the fragment and restore later
10
Glacier
Fragment maintenance– Fragment insertion misses, key space ownership
change, failures may cause fragments lost– A simple protocol
• The node compiles a list of all keys (k,v) in its local fragment store and sends the list to some of its peers
• Each peer replies with a list of manifests for missing object
• The node requires k fragments from its peers, validates them, and computes the fragment to store locally
11
Glacier
Recovery– No need for Glacier to explicitly detect failure
• Compromised nodes– Fail permanently, other nodes take over the key segments– Repaired and rejoin the system with an empty fragment store
– Limits the number of simultaneous fragment reconstructions for a fixed number to avoid congestive collapse
Garbage Collection– Happens when lease expires – Can be carried out independently by each storage node– Grace period TG for maximal clock difference
12
Glacier
Configuration– An object can be reconstructed if r out of N fragment can be
obtained
– Change N and r so that P meets desired durability
– Still offers protection even when fmax is chosen too low– Lease time must be larger than maximal duration of a large-
scale failure – order of months
13
Glacier – Object Aggregation
Massive redundancy – substantially large number of internal objects than application objects
Aggregation of small application objects to reduce the cost of fragments creation and maintenance – tuples (oi, ki, vi)
Aggregation is performed on a per-user basis– Simple, but loses the opportunity of bundle objects from different
users
Local aggregate directory– Aggregates link – forming a directed acyclic graph
14
Glacier – Object Aggregation
Recovery– Both primary store data and the aggregated directory could
be lost after a correlated failure– Recover aggregated directory by walking the DAG
Consolidation– Periodically check the aggregate directory for aggregates
whose leases will expire soon• Not renew the lease if many objects have expired leases
– Non-expired objects are consolidated with new objects to generate a new aggregate
– Particularly effective when object lifetimes are bimodal• Consolidated aggregate contains mostly long-lived objects
15
Glacier - Security
Potential attacks against either durability or the integrity of data stored in Glacier– Attacks on integrity– Attacks on durability– Attacks on the time source– Space-filling attacks– Attacks on Glacier itself– Haystack-needle attacks
16
Evaluation
Glacier prototype– On top of FreePastry implementation of the Pastry
structured overlay– Uses PAST as its primary store
Two sets of experiments– ePost
• A cooperative, server-less email system for a small groups of users
• Glacier used as the storage layer
– Trace-driven simulations• A much larger workload with 147 users and up to 1,000
nodes
17
Evaluation
ePost experiments– 20 to 30 nodes, mostly desktop PCs running linux– 8 passive users and 9 active users– Uses Glacier to store email and corresponding
metadata• N=48, r=5, fmax=60%, pmin=0.999999• Experiment too small to guarantee uncorrelated fragment
losses
– Glacier was able to handle all the failures with the development and test of ePost
18
Evaluation
ePost Workload– Cumulative size of inserted objects over time
• Live – objects not expired yet
– Histogram of object sizes• Bimodal
– Large number of objects between 1-10 KB» Justified aggregation
– A low number of large objects» Usually attachments
19
Evaluation
ePost storage– Amount of storage required by Glacier for the workload
• Grows slowly as new emails entering the system
• XML data structure creates an additional 32% overhead
– On-disk data structures VS actual email payload• Storage overload close to 9.6 * 1.32
20
Evaluation
ePost traffic– Five categories
• Insertion, Refresh, maintenance, handoff, lookup
– In times with low failures, traffic dominated by insertions and refreshes
– During unstable period, handoff and maintenance traffic increases
21
Evaluation
ePOST aggregation– Compare the number of objects with the number of
aggregates in the system• Aggregation reduces the number of keys by over one order of
magnitude
• Low number of expired objects – effective aggregate consolidation
22
Evaluation
Simulation Study – Diurnal Behavior– Glacier and the aggregation layer implemented– Trace from department email server– Diurnal behavior affects message overhead
• Higher churn – Less insertion traffic– More maintenance message for lost fragments recovery
23
Evaluation
Simulation Study – Load– See how load influences the message overhead
• Under light load, message overhead remains about constant
– Aggregates are performed periodically by every node
• Higher load makes overhead increases about linearly
24
Evaluation
Simulation Study – Scalability– Increase overlay size, study per-node traffic
overhead• Remains approximately constant• Grows slowly since the messages are routed using
Pastry
25
Conclusions
Ensures durability of unrecoverable data in a cooperative, decentralized storage system
Robust for large-scale, correlated, Byzantine failures of storage nodes
No introspection
Massive redundancy to mask the effects of correlated failures
Erasure codes and garbage collection to reduce storage cost
Aggregation and fragment maintenance protocol to reduce message costs