Distributed Distributed Systems
L22: Distributed File SystemsTheophilus BensonCS1380 Spring 20
Todays Agenda
• General Distributed File Systems
• Industry Use Cases• Google File System (GFS)
• Next Class• MongoDB (Guest lecture)• Kafka (LinkedIn’s Queue Processing)
WhatisaFile?
• Ablobofbinary?
• Asetofblobs?• Thinkabook:TableofContents+ chapters
• Indexà inode (maprangestodatablocks)• Chaptersà Datablocks
• Howaboutdirectories?• Howaboutfilepermissions?
Data21010101010100100
File11010101010100100
File1Data1Data2
Data11010101010100100
WhatisaDirectory?
• Directory->mapnamestoFileIDs• DirectoryalsocontainsDirectory• InLinuxadirectory--->alsoafile
RootDirectory• File1->IDX• File2à IDY• Dir1à IDZ
Dir1• File3->IDM• File4->IDC
File1Data1Data2
File5Data8Data9
WhatisaFileSystem?
• Filesystemsà systemthatmanagesfiles
• Provides• APIforApplications tointeractw/files• Algorithms forsecuringfiles(access control)• Maintainmetadataaboutafile
Application Application
FileSystem
API
File1 FileN
FileMetaData• Filelength(size)• Timestamp• Location• Referencecount• Type• accesscontrol• Owner
ModifiableByAPP
ModifiableByFileSystem
Distributed File Systems (DFS)
Local Versus Distributed File System
• Failure Implications:• Local: all components are down• Distributed: only some components are down others keep operating
• Performance Implications• Local: interactions are function calls à very fast• Distributed: interactions are RPC calls à variable speed
Client
Storage
Storage Storage Storage
Client
50ms 50ms100ms
Semantics
At-least-once(1 or more calls)
At-most-once(0 or 1 calls)
Transparency Properties of a Distributed File System (DFS)
Client Program/API• Access --> same API for remote/local files• Location à same ``name’’ for remote/local file• Mobility à client should be unaware of files moving
System level Performance• Performance à as workload grows: performance is OK• Scalability à# of files grow: performance is OK Storage Storage Storage
Client
50ms 50ms100ms
Performance Optimizations
• Caching: Client Versus Server Side• Client: minimizes load on server and improves read latency• Server: improves performances
Server-side Caching: Write Issues
• Write-through caching• Write-through: on every write, write to mem à disk à report OK• All writes persist to Disk because Ack. Which provides poor perf. But good consistency
StorageClient
BlockBlock Block
Server-side Caching: Write Issues
• Commits• Commit: on file close, commit/flush all writes to disk.• Writes are to memory until commit à ensures performance but consistency issues
StorageClient
BlockBlock BlockCommit
Performance Optimizations
• Caching: Client Versus Server Side• Client: minimizes load on server and improves read latency• Server: improves performances
• Server Side:• Write Caching: potential consistency issues
• Commit: on file close, commit/flush all writes to disk.• Writes are to memory until commit à ensures performance but consistency issues• Write-through: on every write, write to mem à disk à report OK• All writes persist to Disk because Ack. Which provides poor perf. But good consistency
• Read Caching:• Store recently read blocks in memory for fast/quick access
Client-Caching: Issues
• Locks/Leases• Write/reads are local provided you have a lock/lease
• Two types of locks/leases• Writes: only one client have have this lock• Read: multiple clients have can a read lock. When a write lock is granted, read locks
are revoked
Storage
Client
BlockBlockBlock
CacheBlockBlockBlock
Client-Caching Tradeoffs
Storage
Client
BlockBlockBlock
CacheBlockBlockBlock
Locks Versus Leases
• Locks: client requests, server grants• Client explicitly revokes/gives up lock
• Failure recovery requires tracking locks• Server must track all clients (heartbeats)• On client failure, need complicated
procedure to recover locks (revoke locks)
• Leases: time limit on how long you can hold a resource
• Client Must periodically renew lease• If client does not renew, lock is lost
• Failure recovery is easy• Server doesn’t need to track clients, just leases• On client failure, only need to wait until lease
time out before handing off resource to someone else
Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency
Storage
Client getLease(K)
lease(K, 60s)
renewLease(K)
Storage
Client getLock(K)
revokeLock(K)
OK
Periodically renew or loose access
Microsoft’s Opportunistic Lock (not to be confused with optimistic locking)
https://blogs.msdn.microsoft.com/openspecification/2009/05/22/client-caching-features-oplock-vs-lease/
Opportunistic Locking
Opportunistic Locking
Opportunistic LockingOpportunistic because server only grants the locks if/when convenient.
Performance Optimizations
• Caching: Client Versus Server Side• Client: minimizes load on server and improves read latency• Server: improves performances
• Server Side:• Write Caching: potential consistency issues
• Commit: on file close, commit/flush all writes to disk.• Writes are to memory until commit à ensures performance but consistency issues• Write-through: on every write, write to mem à disk à report OK• All writes persist to Disk because Ack. Which provides poor perf. But good consistency
• Read Caching:• Store recently read blocks in memory for fast/quick access
• Client side:• Locks/leases are used to balance consistency versus performance.
Security and Access Control
• Approaches• Capabilities: client provided a security token which encodes client security.
Server validates the the token is correct and use the permissions in the token to control access to resources
• Access Lists: server maintains a list of permission, on every access the server consults this list to verify that client has permissions
• Approaches in DFS
• On open validate and give client a ‘capability’• Client uses ‘capability’ with all future requests
• For every request, client includes identify information
StorageClientRPC(API + capability)
StorageClient
ACL List
RPC(API + credentials)
Validate credentials have access to API
Capability includes access information
Potential Security Trade-offs: Capabilities versus ACL
• Capabilities: hard to revoke/change permissions• Permissions only checked at the beginning• Must send a revocation list and force reissue
• ACL: since centralized list --- easy to change and adopt• Every API call uses the list so changes reflected on the next call
StorageClientRPC(API + capability)
StorageClient
ACL List
RPC(API + credentials)
Validate credentials have access to API
Capability includes access information
Revoke List
GFS: Google File System
GFS
• Two types of nodes!• Master • Chuck servers
• Masters (few API calls)• Metadata operations
• Chunk servers (most client API calls)• Stores actual data
GFS Master
• Single/Centralized Master• Never store the files contents• Only store metadata/attributes
• Benefits of centralization• Easy to write code• Can implement sophisticated algorithms • Store all metadata in memory à perf. Boost
• Issues with centralization• Single point of failure à 2 backups
• Replicate to backups before responding to client• Not enough memory à buy more!!!
ChunkServer
Client
Gmail
GFSMaster
GFSMasterShadow
Master
BackUp masters
GFS Attributes, Data, Metadata
Chunks-> server
Dir->FilesFiles->Chucks
DATA (i.e., Chunks)
GFS Attributes, Data, Metadata
• Chunk->server mappings not stored at master
• ChunkServer can die• Operator can manually change
ChunkServer• ChunkServer is the authoritative
voice on what it stores
• ChunkServer includes list of chucks in heartbeat msgs
• Master rebuilds a map of chunk->server locations after receiving heartbeat msgs
Chunks-> server
Dir->FilesFiles->Chucks
ChunkServer
Client
Gmail
GFSMaster
GFSMasterShadow
Master
BackUp masters
DATA (i.e., Chunks)
Chunks-> server
GFS Consistency Semantics
• Types of API calls• Metadata operations: create/delete/rename• Data operations: read/writes
• All metadata à Masters: linearizable because master gives global ordering
• Read/writes à ChuckServer à potential consistency issues Chunks-> server
Dir->FilesFiles->Chucks
DATA (i.e., Chunks)
• Use heart beats to detect failures• Maintain three replicas of each chunk
• On failed server create a new replica
• Monitor load on each server• Periodically move replica/chunks around to
balance load
• Single masters provides global total ordering on meta data operations
• Master give leases to coordinate writes on data.
• One replica is denoted master for other replicas
• Use heart beats to detect failures• Maintain three replicas of each chunk
• On failed server create a new replica
• Monitor load on each server• Periodically move replica/chunks around to
balance load
• Single masters provides global total ordering on meta data operations
• Master give leases to coordinate writes on data.
• One replica is denoted master for other replicas
GFSMaster
GFSMasterShadow
Master
BackUp masters
ChunkServer
ChunkServer
ChunkServerLeader for chunk
replica
Client
Gmail
List of chuckservers
Writes
writes
writes
HeartBeats(Chunk List)
LeaderLease
Open()
Today
• Distributed files systems • Caching: performance versus consistency• Locks V. Leases: opportunistic locking• Server v. client side caches
• GFS: Google File Systems• Centralized Masters• Consistency semantics