nfsv4 replication for grid computing
DESCRIPTION
We develop a consistent mutable replication extension for NFSv4 tuned to meet the rigorous demands of large-scale data sharing in global collaborations. The system uses a hierarchical replication control protocol that dynamically elects a primary server at various granularities. Experimental evaluation indicates a substantial performance advantage over a single server system. With the introduction of the hierarchical replication control, the overhead of replication is negligible even when applications mostly write and replication servers are widely distributed.TRANSCRIPT
NFSv4 Replication for Grid Computing
Peter HoneymanCenter for Information Technology IntegrationUniversity of Michigan, Ann Arbor
Acknowledgements
Joint work with Jiaying Zhang UM CSE doctoral candidate Defending later this month
Partially supported by NSF/NMI GridNFS DOE/SciDAC Petascale Data Storage Institute Network Appliance, Inc. IBM ARC
Outline
Background Consistent replication
Fine-grained replication control Hierarchical replication control
Evaluation Durability revisited NEW! Conclusion
SKIP SKIP SKIP
SKIP SKIP SKIP SKIP SKIP
Grid computing
Emerging global scientific collaborations require access to widely distributed data that is reliable, efficient, and convenient
Grid Computing
SKIP SKIP SKIP
GridFTP
Advantages Automatic negotiation of TCP options Parallel data transfer Integrated Grid security Easy to install and support across a broad
range of platforms Drawbacks
Data sharing requires manual synchronization
SKIP SKIP SKIP
NFSv4
Advantages Traditional, well-understood file system
semantics Supports multiple security mechanisms Close-to-open consistency
Reader is is guaranteed to see data written by the last writer to close the file
Drawbacks Wide-area performanceSKIP SKIP SKIP
NFSv4.r
Research prototype developed at CITI Replicated file system build on NFSv4 Server-to-server replication control
protocol High performance data access Conventional file system semantics
SKIP SKIP SKIP
Replication in practice
Read-only replication Clumsy manual release model Lacks complex data sharing (concurrent
writes) Optimistic replication
Inconsistent consistency
SKIP SKIP SKIP
Consistent replication
Problem: state of the practice in file system replication does not satisfy the requirements of global scientific collaborations
How can we provide Grid applications efficient and reliable data access?
Consistent replicationConsistent replication
SKIP SKIP SKIP
Design principles
Optimal read-only behavior Performance must be identical to un-replicated
local system Concurrent write behavior
Ordered writes, i.e., one-copy serializability Close-to-open semantics
Fine-grained replication control The granularity of replication control is a single
file or directorySKIP SKIP SKIP
wopen
client
Replication control
When a client opens a file for writing, the selected server temporarily becomes the primary for that fileOther replication servers are instructed to forward client requests for that file to the primary if concurrent writes occur
SKIP SKIP SKIP
write
client
Replication control
The primary server asynchronously distributes updates to other servers during file modification
SKIP SKIP SKIP
close
client
Replication controlWhen the file is closed and all replication servers are synchronized, the primary server notifies the other replication servers that it is no longer the primary server for the file
SKIP SKIP SKIP
Directory updates
Prohibit concurrent updates A replication server waits for the primary to
relinquish its role Atomicity for updates that involve multiple
objects (e.g. rename) A server must become primary for all objects Updates are grouped and processed together
SKIP SKIP SKIP
Close-to-open semantics
Server becomes primary after it collects votes from a majority of replication servers
Use a majority consensus algorithm Cost is dominated by the median RTT from the primary
server to other replication servers Primary server must ensure that every
replication server has acknowledged its election when a written file is closed
Guarantees close-to-open semantics Heuristic: allow a new file to inherit the primary server
that controls its parent directory for file creationSKIP SKIP SKIP
Durability guarantee
“Active view” update policy Every server keeps track of the liveness of other servers
(active view) Primary server removes from its active view any server
that fails to respond to its request Primary server distributes updates synchronously and in
parallel Primary server acknowledges a client write after a
majority of replication servers reply Primary sends other servers its active view with file
close A failed replication server must synchronize with the up-
to-date copy before it can rejoin the active group I suppose this is expensiveSKIP SKIP SKIP
What I skipped
Not the Right Stuff GridFTP: manual synchronization NFSv4.r: write-mostly WAN performance AFS, Coda, et al.: sharing semantics
Consistent replication for Grid computing Ordered writes too weak Strict consistency too strong Open-to-close just right
NFSv4.r in brief
View-based replication control protocol Based on (provably correct) El-Abbadi, Skeen,
and Cristian Dynamic election of primary server
At the granularity of a single file or directory Majority consensus on open (for
synchronization) Synchronous updates to a majority (for
durability) Total consensus on close (for close-to-
open)
Write-mostly WAN performance
Durability overhead Synchronous updates
Synchronization overhead Consensus management
Asynchronous updates
Consensus requirement delays client updates Median RTT between the primary server and
other replication servers is costly Synchronous write performance is worse
Solution: asynchronous update Let application decide whether to wait for
server recovery or regenerate the computation results
OK for Grid computations that checkpoint Revisit at end with new ideas
Hierarchical replication control
Synchronization is costly over WAN Hierarchical replication control
Amortizes consensus management A primary server can assert control at
different granularities
Shallow & deep control
/usr
bin local
/usr
bin local
A server with a shallow control on a file or directory is the primary server for that single objectA server with a deep control on a directory is the primary server for everything in the subtree rooted at that directory
Primary server election
Allow deep control for a directory D if D has no descendent is controlled by another server
Grant a shallow control request for object L from peer server P if L is not controlled by a server other than P
Grant a deep control request for directory D from peer server P if D is not controlled by a server other than P No descendant of D is controlled by a server
other than PSKIP SKIP SKIP
Ancestry table
/root
a b
c f2 d2controlled by S1
controlled by S0 controlled by S0
controlled by S2
……
Idcounter array
S0 S1 S2
root 2 1 1
a 2 1 0
b 0 0 1
c 2 0 0
Ancestry Table
The data structure of entries in the ancestry table
d1f1
Ancestry Entry an ancestry entry has the following attributesid = unique identifier of the directoryarray of counters = set of counters recording which servers controls
the directory’s descendants
Primary election
S0 and S1 succeed in their primary server elections
S2’s election fails due to conflicts Solution - S2 then re-tries by asking for
shallow control of a
a
b c
S0 S1
S2
control bcontrol c
control b
deep control a
cont
rol c
deep
con
trol a
S0 S1
S2
SKIP SKIP SKIP
Performance vs. concurrency
Associate a timer with deep control Reset the timer with subsequent updates Release deep control when timer expires A small timer value captures bursty updates
Issue a separate shallow control for a file written under a deep controlled directory Still process the write request immediately Subsequent writes on the file do not reset the
timer of the deep controlled directorySKIP SKIP SKIP SKIP
Performance vs. concurrency
Increase concurrency when the system consists of multiple writers Send a revoke request upon concurrent writes The primary server shortens releasing timer
Optimally issues a deep control request for a directory that contains many updates in single writer cases
SKIP SKIP SKIP
Single remote NFS
1
10
100
1000
10000
0.2 5 10 20 30 40RTT between NFS server and client (ms)
SSH build time (s)
unpack configure build
N.B.: log scale
Deep vs. shallow
0
100
200
300
400
500
600
700
800
0.2 20 40 60 80 100 120 singlelocal
serverRTT between two replication servers (ms)
SSH build time (s)
unpack configure build
Shallow controls vs. deep + shallow controls
Deep control timer
0
40
80
120
160
200
0.220406080100120 0.220406080100120 0.220406080100120
RTT between two replication servers (ms)
SSH build time (s)
unpack configure build
0.1s timer 0.5s timer 1s timer
single local server
Durability revisited
Synchronization is expensive, but … When we abandon the durability
guarantee, we risk losing the results of the computation And may be forced to rerun it But it might be worth it
Goal: maximize utilization
NEW NEW NEW
Utilization tradeoffs
Adding synchronous replication servers enhances durability Which reduces the risk that results are lost And that the computation must be restarted Which benefits utilization
But increases run time Which reduces utilization
Placement tradeoffs
Nearby replication servers reduce the replication penalty Which benefits utilization
Nearby replication servers are more vulnerable to correlated failure Which reduces utilization
Run-time model
start
run
end
recover
ok k fail
fail
Parameters
F: failure free, single server run time C: replication overhead R: recovery time pfail: server failure precover: successful recovery
F: run time
Failure-free, single server run time Can be estimated or measured Our focus is on 1 to 10 days
C: replication overhead
Penalty associated with replication to backup servers
Proportional to RTT Ratio can be measured by running with a
backup server a few msec away
R: recovery time
Time to detect failure of the primary server and switch to a backup server
We assume R << F Arbitrary realistic value: 10 minutes
Failure distributions
Estimated by analyzing PlanetLab ping data 716 nodes, 349 sites, 25 countries All-pairs, 15 minute interval From January 2004 to June 2005
692 nodes were alive throughout
We ascribe missing pings to node failure and network partition
PlanetLab failure CDF
Same-site correlated failures
259 65 21 11
2 0.526 0.593 0.552 0.561
3 0.546 0.440 0.538
4 0.378 0.488
5 0.488
sites
nodes
Different-site correlated failures
0
0.05
0.1
0.15
0.2
0.25
0 20 40 60 80 100 120 140 160 180 200Maxium RTT (ms)
Average Failure Correlations
2nodes 3nodes 4nodes 5nodes
y = -0.0002x + 0.1955y = -0.0002x + 0.155y = -0.0002x + 0.1335y = -0.0002x + 0.1186
0.050.070.090.110.130.150.170.190.210.23
0 20 40 60 80 100 120 140 160 180 200Maxium RTT (ms)
Average Failure Correlations
Run-time model
Discrete event simulation yields expected run time E and utilization (F ÷ E)
start
run
end
recover
ok k fail
fail
Simulated utilizationF = one hour
C=0.1F C=0.01F C=0.001F C=0.0001F No backup
0.900.910.920.930.940.950.960.970.980.991.00
0 10 20 30 40 50 60
One backup server
0.00.10.20.30.40.50.60.70.80.91.0
0 20 40 60 80 100 120 140 160 180 200
0.900.910.920.930.940.950.960.970.980.991.00
0 10 20 30 40 50 60
Four backup servers
Simulation resultsF = one day
One backup server Four backup servers
0.00.10.20.30.40.50.60.70.80.91.0
0 20 40 60 80 100 120 140 160 180 200
0.800.820.840.860.880.900.920.940.960.981.00
0 10 20 30 40 50 600.800.820.840.860.880.900.920.940.960.981.00
0 10 20 30 40 50 60
C=0.1F C=0.01F C=0.001F C=0.0001F No backup
Simulation resultsF = ten days
One backup server Four backup servers
0.500.550.600.650.700.750.800.850.900.951.00
0 10 20 30 40 50 60
0.00.10.20.30.40.50.60.70.80.91.0
0 20 40 60 80 100 120 140 160 180 200
0.500.550.600.650.700.750.800.850.900.951.00
0 10 20 30 40 50 60
0.00.10.20.30.40.50.60.70.80.91.0
0 20 40 60 80 100 120 140 160 180 200
Simulation results discussion
For long-running jobs Replication improves utilization Distant servers improve utilization
For short jobs Replication does not improve utilization
In general, multiple backup servers don’t help much
Implications for checkpoint interval …
Checkpoint interval
0.00.10.20.30.40.50.60.70.80.91.0
0 20 40 60 80 100 120 140 160 180 200
F = one dayOne backup server20% checkpoint overhead
F = ten days, 2% checkpoint overhead
One backup server Four backup servers
0.00.10.20.30.40.50.60.70.80.91.0
0 20 40 60 80 100 120 140 160 180 2000.00.10.20.30.40.50.60.70.80.91.0
0 20 40 60 80 100 120 140 160 180 200
Next steps
Checkpoint overhead? Replication overhead?
Depends on amount of computation We measure < 10% for NAS Grid Benchmarks,
which do no computation Refine model
Account for other failures Because they are common
Other model improvements
Conclusions
Conventional wisdom holds thatconsistent mutable replication
in large-scale distributed systems is too expensive to consider
Our study proves otherwise
Conclusions
Consistent replication in large-scale distributed storage systems is
feasible and practical Superior performance Rigorous adherence to conventional
file system semantics Improves cluster utilization
Thank you for your attention!www.citi.umich.edu
Questions?