Distributed File Systems
CS 519: Operating System Theory
Computer Science, Rutgers University
Instructor: Thu D. Nguyen
TA: Xiaoyan Li
Spring 2002
Computer Science, Rutgers CS 519: Operating System Theory2
File Service
Implemented by a user/kernel process called file server
A system may have one or several file servers running at the same time
Two models for file servicesupload/download: files move between server and clients, few operations (read file & write file), simple, requires storage at client, good if whole file is accessed
remote memory access: files stay at server, reach interface for many operations, less space at client, efficient for small accesses
Computer Science, Rutgers CS 519: Operating System Theory3
Directory Service
Provides naming usually within a hierarchical file system
Clients can have the same view (global root directory) or different views of the file system (remote mounting)
Location transparent: location of the file doesn’t appear in the name of the file
ex: /server1/dir1/file specifies the server but not where the server is located -> server can move the file in the network without changing the path
Location independence: a single name space that looks the same on all machines, files can be moved between servers without changing their names -> difficult
Computer Science, Rutgers CS 519: Operating System Theory4
Two-Level Naming
Symbolic name (external), e.g. prog.c; binary name (internal), e.g. local i-node number as in Unix
Directories provide the translation from symbolic to binary names
Binary name formati-node: no cross references among servers
(server, i-node): a directory in one server can refer to a file on a different server
Capability specifying address of server, number of file, access permissions, etc
{binary_name+}: binary names refer to the original file and all of its backups
Computer Science, Rutgers CS 519: Operating System Theory5
File Sharing Semantics
UNIX semantics: total ordering of R/W events easy to achieve in a non-distributed system
in a distributed system with one server and multiple clients with no caching at client, total ordering is also easily achieved since R and W are immediately performed at server
Session semantics: writes are guaranteed to become visible only when the file is closed
allow caching at client with lazy updating -> better performance
if two or more clients simultaneously write: one file (last one or non-deterministically) replaces the other
Computer Science, Rutgers CS 519: Operating System Theory6
File Sharing Semantics (cont’d)
Immutable files: create and read file operations (no write)
writing a file means to create a new one and enter it into the directory replacing the previous one with the same name: atomic operations
collision in writing: last copy or nondeterministically
what happens if the old copy is being read
Transaction semantics: mutual exclusion on file accesses; either all file operations are completed or none is. Good for banking systems
Computer Science, Rutgers CS 519: Operating System Theory7
File System Properties
Observed in a study by Satyanarayanan (1981)most files are small (< 10K)
reading is much more frequent than writing
most R&W accesses are sequential (random access is rare)
most files have a short lifetime -> create the file on the client
file sharing is unusual -> caching at client
the average process uses only a few files
Computer Science, Rutgers CS 519: Operating System Theory8
Server System Structure
File + directory service: combined or not
Cache directory hints at client to accelerate the path name look up – directory and hints must be kept coherent
State information about clients at the serverstateless server: no client information is kept between requests
stateful server: servers maintain state information about clients between requests
Computer Science, Rutgers CS 519: Operating System Theory9
Stateless vs. Stateful
requests are self-contained (every access may need translation)
better fault tolerance open/close at client (fewer
msgs) no space reserved for file
descriptor tables thus, no limit of open files no problem if client crashes
shorter messages better performance (info in
memory until close) open/close at server file locking possible read ahead possible
Stateless Server Stateful Servers
Computer Science, Rutgers CS 519: Operating System Theory10
Caching
Three possible places: server’s memory, client’s disk, client’s memory
Caching in server’s memory: avoids disk access but still network access
Caching at client’s disk (if available): tradeoff between disk access and remote memory access
Caching at client usually in main memory inside each process address space: no sharing at client
in the kernel: kernel involvement on hits
in a separate user-level cache manager: flexible and efficient if paging can be controlled from user-level
Server-side caching eliminates coherence problem. Client-side cache coherence? Next…
Computer Science, Rutgers CS 519: Operating System Theory11
Client Cache Coherence in DFS
How to maintain coherence (according to a model, e.g. UNIX semantics or session semantics) of copies of the same file at various clients
Write-through: writes sent to the server as soon as they are performed at the client -> high traffic, requires cache managers to check (modification time) with server before can provide cached content to any client
Delayed write: coalesces multiple writes; better performance but ambiguous semantics
Write-on-close: implements session semantics
Central control: file server keeps a directory of open/cached files at clients -> Unix semantics, but problems with robustness and scalability; problem also with invalidation messages because clients did not solicit them
Computer Science, Rutgers CS 519: Operating System Theory12
File Replication
Multiple copies are maintained, each copy on a separate file server - multiple reasons:
Increase reliability: file accessible even if a server is down
Improve scalability: reduce the contention by splitting the workload over multiple servers
Replication transparencyexplicit file replication: programmer controls replication
lazy file replication: copies made by the server in background
use group communication: all copies made at the same time in the foreground
How replicas should be modified? Next…
Computer Science, Rutgers CS 519: Operating System Theory13
Modifying Replicas: Voting Protocol
Updating all replicas using a coordinator works but is not robust (if coordinator is down, no updates can be performed) => Voting: updates (and reads) can be performed if some specified # of servers agree.
Voting Protocol:A version # (incremented at write) is associated with each file
To perform a read, a client has to assemble a read quorum of Nr servers; similarly, a write quorum of Nw servers for a write
If Nr + Nw > N, then any read quorum will contain at least one most recently updated file version
For reading, client contacts Nr active servers and chooses the file with largest version #
For writing, client contacts Nw active servers asking them to write. Succeeds if they all say yes.
Computer Science, Rutgers CS 519: Operating System Theory14
Modifying Replicas: Voting Protocol
Nr is usually small (reads are frequent), but Nw is usually close to N (want to make sure all replicas are updated). Problem with achieving a write quorum in the presence of server failures
Voting with ghosts: allows to establish a write quorum when several servers are down by temporarily creating dummy (ghost) servers (at least one must be real)
Ghost servers are not permitted in a read quorum (they don’t have any files)
When server comes back it must restore its copy first by obtaining a read quorum
Computer Science, Rutgers CS 519: Operating System Theory15
Network File System (NFS)
A stateless DFS implemented at Sun
An NFS server exports directories
Clients access exported directories by mounting them
Because NFS is stateless, OPEN and CLOSE operations are not needed in the server (implemented at the client)
NFS provides file locking but UNIX file semantics is not achieved because of client caching
Write through protocol, but delay is possible: dirty cache blocks are sent back by clients in chunks, every 30 sec or at close
a timer is associated with each cache block at the client (3 sec for data blocks, 30 sec for directory blocks). When the timer expires, the entry is discarded (if clean, of course)
when a file is opened, the last modification time at the server is checked
Computer Science, Rutgers CS 519: Operating System Theory16
Recent Research in DFS
Petal & Frangipani (DEC SRC): 2-layer DFS system
Computer Science, Rutgers CS 519: Operating System Theory17
Petal: Distributed Virtual Disks
A distributed storage system that provides a virtual disk abstraction separate from the physical resource
The virtual disk is globally accessible to all Petal clients on the network
Virtual disks are implemented on a cluster of servers that cooperate to manage a pool of physical disks
Advantages
recover from any single failure
transparent reconfiguration and expandability
load and capacity balancing
low-level service (lower than a DFS) that handles distribution problems
Computer Science, Rutgers CS 519: Operating System Theory19
Virtual to Physical Translation
<virtual disk, virtual offset> -> <server, physical disk, physical offset>
Three data structures: virtual disk directory, global map, and physical map
The virtual disk directory and global map are globally replicated and kept consistent
Physical map is local to each server
One level of indirection (virtual disk to global map) is necessary to allow transparent reconfiguration. We’ll discuss reconfiguration soon
Computer Science, Rutgers CS 519: Operating System Theory20
Virtual to Physical Translation (cont’d)
The virtual disk directory translates the virtual disk identifier (like volume id) into a global map identifier
The global map determines the server responsible for translating the given offset (a virtual disk may be spread over multiple physical disks). The global map also specifies the redundancy scheme for the virtual disk
The physical map at specific server translates global map identifier and the offset to a physical disk and an offset within that disk. Physical map is similar to a page table
Computer Science, Rutgers CS 519: Operating System Theory21
Support for Backup
Petal simplifies a client’s backup procedure by providing a snapshot mechanism
Petal generates snapshots of virtual disks using copy-on-write (backup files are pointing to old blocks with write protection). Creating a snapshot requires pausing the client’s application to guarantee consistency
A snapshot is a virtual disk that cannot be modified
Snapshots require a modification to the translation scheme. The virtual disk directory translates a virtual disk id into a pair <global map id, epoch #> where epoch # is incremented at each snapshot
At each snapshot a new tuple with a new epoch is created in the virtual disk directory. The snapshot takes the old epoch #
All accesses to the virtual disk are made using the new epoch #, so that any write to the original disk create new entries in the new epoch rather than overwrite the blocks in the snapshot
Computer Science, Rutgers CS 519: Operating System Theory22
Virtual Disk Reconfiguration
Needed when a new server is added or the redundancy scheme is changed
Steps to perform it at once (not incrementally) and in the absence of any other activity:
create a new global map with desired redundancy scheme and server mapping
change all virtual disk directories to point to the new global map
redistribute data to the severs according to the translation specified in the new global map
The challenge is to perform it incrementally and concurrently with normal client requests
Computer Science, Rutgers CS 519: Operating System Theory23
Incremental Reconfiguration
First two steps as before; step 3 done in background starting with the translations in the most recent epoch that have not yet been moved
Old global map is used to perform read translations which are not found in the new global map
A write request only accesses the new global map to avoid consistency problems
Limitation: the mapping of the entire virtual disk must be changed before any data is moved -> lots of new global map misses on reads -> high traffic. Solution: relocate only a portion of the virtual disk at a time. Read requests for portion of virtual disk being relocated cause misses, but not requests to other areas
Computer Science, Rutgers CS 519: Operating System Theory24
Redundancy with Chained Data Placement
Petal uses chained-declustering data placement
two copies of each data block are stored on neighboring servers
every pair of neighboring servers has data blocks in common
if server 1 fails, servers 0 and 2 will share server’s read load (not server 3)
server 0 server 1 server 2 server 3 d0 d1 d2 d3 d3 d0 d1 d2 d4 d5 d6 d7 d7 d4 d5 d6
Computer Science, Rutgers CS 519: Operating System Theory25
Chained Data Placement (cont’d)
In case of failure, each server can offload some of its original read load to the next/previous server. Offloading can be cascaded across servers to uniformly balance load
Advantage: with a simple mirrored redundancy, the failure of a server would result in a 100% load increase to another server
Disadvantage: less reliable than simple mirroring - if a server fails, the failure of either one of its two neighbor servers will result in data becoming unavailable
In Petal, one copy is called primary, the other secondary
Read requests can be serviced by any of the two servers, while write requests must always try the primary first to prevent deadlock (blocks are locked before reading or writing, but writes require access to both servers)
Computer Science, Rutgers CS 519: Operating System Theory26
Read Request
The Petal client tries primary or secondary server depending on which one has the shorter queue length. (Each client maintains a small amount of high-level mapping information that is used to route requests to the “most appropriate” servers. If a request is sent to an inappropriate server, the server returns an error code, causing the client to update its hints and retry the request)
The server that receives the request attempts to read the requested data
If not successful, the client tries the other server
Computer Science, Rutgers CS 519: Operating System Theory27
Write Request
The Petal client tries the primary server first
The primary server marks data busy and sends the request to its local copy and the secondary copy
When both complete, the busy bit is cleared and the operation is acknowledged to the client
If not successful, the client tries the secondary server
If the secondary server detects that the primary server is down, it marks the data element as stale on stable storage before writing to its local disk
When the primary server comes up, the primary server has to bring all data marked stale up-to-date during recovery
Similar if secondary server is down
Computer Science, Rutgers CS 519: Operating System Theory29
Petal Performance - Latency
Single client generates requests to random disk offsets
Computer Science, Rutgers CS 519: Operating System Theory30
Petal Performance - Throughput
Each of 4 clients making random requests to single VD.Failed configuration = one of 4 servers has crashed
Computer Science, Rutgers CS 519: Operating System Theory32
Frangipani
Petal provides disk interface -> need a file system
Frangipani is a file system designed to take full advantage of Petal
Frangipani’s main characteristics:All users are given a consistent view of the same set of files
Servers can be added without changing configuration of existing servers or interrupting their operation
Tolerates and recovers from machine, network, and disk failures
Very simple internally: a set of cooperating machines that use a common store and synchronize access to that store with locks
Computer Science, Rutgers CS 519: Operating System Theory33
Frangipani
Petal takes much of the complexity out of Frangipani
Petal provides highly available storage that can scale in throughput and capacity
However, Frangipani improves on Petal, since:
Petal has no provision for sharing the storage among multiple clients
Applications use a file-based interface rather than the disk-like interface provided by Petal
Problems with Frangipani on top of Petal:
Some logging occurs twice (once in Frangipani and once in Petal)
Cannot use disk location in placing data, because Petal virtualizes disks
Frangipani locks entire files and directories as opposed to individual blocks
Computer Science, Rutgers CS 519: Operating System Theory35
Frangipani: Disk Layout
A Frangipani file system uses only 1 Petal virtual disk
Petal provides a 264 bytes of “virtual” disk spaceCommits real disk space when actually used (written)
Frangipani breaks disk into regions1st region stores configuration parameters and housekeeping info
2nd region stores logs – each Frangipani server uses a portion of this region for its log. Can have up to 256 logs.
3rd region holds allocation bitmaps, describing which blocks in the remaining regions are free. Each server locks a different portion.
4th region holds inodes
5th region holds small data blocks (4 Kbytes each)
Remainder of Petal disk holds large data blocks (1 Tbyte each)
Computer Science, Rutgers CS 519: Operating System Theory36
Frangipani: File Structure
First 16 blocks (64 KB) of a file are stored in small blocks
If file becomes larger, store the rest in a 1 TB large block
Computer Science, Rutgers CS 519: Operating System Theory37
Frangipani: Dealing with Failures
Write-ahead redo logging of metadata; user data is not logged (but Petal takes care of that).
Each Frangipani server has its own private log
Only after a log record is written to Petal does the server modify the actual metadata in its permanent locations
If a server crashes, the system detects the failure and another server uses the log to recover
Because the log is on Petal, any server can get to it.
Computer Science, Rutgers CS 519: Operating System Theory38
Frangipani: Synchronization & Coherence
Frangipani has a lock for each log segment, allocation bitmap segment, and each file
Multiple-reader/single-writer locks. In case of conflicting requests, the owner of the lock is asked to release or downgrade it to remove the conflict
A read lock allows a server to read data from disk and cache it. If server is asked to release its read lock, it must invalidate the cache entry before complying
A write lock allows a server to read or write data and cache it. If a server is asked to release its write lock, it must write dirty data to disk and invalidate the cache entry before complying. If a server is asked to downgrade the lock, it must write dirty data to disk before complying
Computer Science, Rutgers CS 519: Operating System Theory39
Frangipani: Lock Service
Fully distributed lock service for fault tolerance and scalability
How to release locks owned by a failed Frangipani server?
The failure of a server is discovered when its “lease” expires. A lease is obtained by the server when it first contacts the lock service. All locks acquired are associated with the lease. Each lease has an expiration time (30 seconds) after its creation or last renewal. A server must renew its lease before it expires
When a server fails, the locks that it owns cannot be released until its log is processed and any pending updates are written to Petal