jeff darcy / mark wagner - pl.atyp.uspl.atyp.us/hekafs.org/docs/cloudfs_summit_notes.pdf · jeff...
TRANSCRIPT
1
2
Jeff Darcy / Mark WagnerPrincipal Software Engineers, Red Hat4 May, 2011
BUILDING A CLOUD
FILESYSTEM
What's It For?
● “Filesystem as a Service”
● Managed by one provider, used by many tenants
Familiarity
Scalability Flexibility
Privacy
Familiarity: POSIX API, typical filesystem capacity/durability characteristics, (fairly) typical filesystem performance profile.
Flexibility: add/remove servers, add/remove tenants, without disruptive downtime
Privacy: encryption must be done both on network and on disk, the latter pure client side (unlike Dropbox, too many others)
Considered adding reference to reliability and recent EBS outage, but (a) too lazy and (b) works like scalability for purposes of this presentation.
What About Existing Filesystems?
● GlusterFS, PVFS2, Ceph, ...● Not all the same (distributed vs. cluster)● Even the best don't cover all the bases
Familiarity A-
Scalability B+
Flexibility C+
Privacy F
Familiarity: the minus is because of slight sacrifices in consistency or other areas (e.g. shared mmap, write to unlinked) made by some alternatives.
Scalability: some better than others, primarily due to single metadata server in some alternatives.
Flexibility: Most require downtime during any kind of reconfiguration.
Privacy: “Incomplete” would be a more accurate grade, because most never even had this as a goal. Shout out to OrangeFS for thinking deeply about this.
Privacy Part 1: Separate Namespace
tenantX# ls /mnt/shared_fs/tenantYa.txt b.txt my_secret_file.txt
● Tenant X's files should be completely invisible to any other tenant
● Ditto for space usage
● Solvable with subvolume mounts and directory permissions, but watch out for symlinks etc.
Privacy Part 2: Separate ID Space
● Tenant X's “joe” has the same UID as tenant Y's “fred”
● Two tenants should not have the same UIDs● ...but server only has one UID space● must map between per-server and per-tenant spaces
server# ls /shared/tenantX/joe/foo-rw-r--r-- 1 joe joe 9262 Jan 20 12:00 foo
server# ls /shared/tenantY/fred/bar-rw-r--r-- 1 joe joe 6481 Mar 09 13:47 bar
“Register your users with our ID service”
Add another step? I createthousands of users every day!
I already run my own ID service,to sync across the company.
Amazon doesn't require that!
It was nice knowing you.
“Own ID service” point is important because this would mean running two ID services simultaneously. Multi-realm Kerberos etc. are *very* painful.
Privacy Part 3: At Rest Encryption
● Where did it come from? Whose data is on it?
● Moral: encrypt and store key separately
Yes, this does happen. Nowadays you can find videos of disks being destroyed when they're decommissioned by large sites, but those are largely a *reaction* to previous incidents and you can't trust every single cloud provider to be so diligent.
Privacy Part 4: Wire Encryption + Authentication
● Know who you're talking to
● Make sure nobody else can listen (or spoof)
Picture credit:http://www.owasp.org/index.php/Man-in-the-middle_attack
CloudFS
● Builds on established technology
● Adds specific functionality for cloud deployment
Familiarity
Scalability Flexibility
Privacy
GlusterFS
SSL CloudFS
Basic idea is to leverage GlusterFS's features and also acceptance/community/etc. (as important as anything in the code) then add bits to address remaining needs.
GlusterFS Core Concept: Translators
● So named because they translate upper-level I/O requests into lower-level I/O requests using the same interface
● stackable in any order● can be deployed on either client or server
● Lowest-level “bricks” are just directories on servers
● GlusterFS is an engine to route filesystem requests through translators to bricks
Important to note that the modularity is not just an engineering issue but also packaging, licensing, support, etc. None of the alternatives, even those with good internal architecture, really support this level of “separate add-on” model.
Translator Patterns
Brick 1(do nothing)
Brick 2(do nothing)
Caching(read XXXX)
Brick 1(read XX)
Brick 2(read YY)
Splitting(read XXYY)
Brick 1(write XXYY)
Brick 2(write XXYY)
Replicating(write XXYY)
Brick 1(do nothing)
Brick 2(write XXYY)
Routing(write XXYY)
Examples• routing pattern: DHT• replicating pattern: AFR• splitting pattern: stripe• null pattern: most performance xlators• one-to-one: most feature xlators
Translator Types
● Protocols: client, (native) server, NFS server
● Core: distribute (DHT), replicate (AFR), stripe
● Features: locks, access control, quota
● Performance: prefetching, caching, write-behind
● Debugging: trace, latency measurement
● CloudFS: ID mapping, authentication/encryption, future “Dynamo” and async replication
Typical Translator Structure
Mount(FUSE)
Cache
Distribute
Replicate
Client A Client B Client C Client D
Replicate
Server A/export
Server B/foo
Server C/bar
Server D/x/y/z
clientside
“Client side” is just a convention, albeit a good one. It's not that uncommon e.g. to move replication over to the server side.
Let's Discuss Performance
Test Hardware
● Testing on Westmere EP Server class machines● Two Socket, HT on
● 12 boxes total● 48 GB fast memory● 15K drives● 10Gbit – 9K Jumbo frame enabled
● 4 Servers with fully populated Internal SAS drives (7)
● 8 boxes used as clients / VM hosts
NFSv4, ext4, etc. GlusterFS configuration is distribute only (no replication) to make compare equivalent functionality.
Hardware
Servers10 Gbit Network Switch
Clients
First Performance Question We Get Asked
● How does it Stack up to NFS ?
● One of the first tests we ran before we tuned
● Tests conducted on same box / storage
GlusterFS vs. NFS (Writes)
1 2 40
100
200
300
400
500
600
700
800
900
1000
Single Server NFS vs Gluster Throughput Comparision
64K Writes
NFS write Gluster write
Number of Client Threads
Thr
ough
put
in M
Byt
es /
sec
GlusterFS vs. NFS (Reads)
1 2 40
100
200
300
400
500
600
700
800
900
1000
Single Server NFS vs Gluster Read Throughput Comparision
64KB Byte Reads
NFS Gluster
Number of Client Threads
Thr
ough
put
in M
Byt
es /
sec
IO Bound
Second Question
● How Does it Scale ?
● Tests run on combinations of Servers, clients, Vms
● Representative sample shown here
● Scale up across servers, hosts and threads
Read Scalability - Baremetal
1 2 4 80
500
1000
1500
2000
2500
3000
Read scalability- constant 64 threadsSplit over varying number of clients, servers
1 server 2 servers4 servers
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
Byt
es /
sec
)
Pretty much 80% wire speed times number of servers. Notice how one client sees very little benefit from adding servers (as expected).
Write Scalability – Bare metal
1 2 4 80
500
1000
1500
2000
2500
3000
3500
Write scalability - constant 64 threads Split over varying number of clients, servers
1 server 2 servers4 servers
Number of Clients
Agg
rega
te T
hrou
ghpu
t (M
Byt
es /
sec
)
Tuning Fun
● Now that we have the basics, lets play
● Initial tests on RAID0
● Lets try JBOD
Alternatively, compare the effects of letting the RAID controller handle scheduling across disks (on one server) vs. letting GlusterFS do it.
Tuning Tips - Storage Layout
0 50 100 150 200 250 3000
200
400
600
800
1000
1200
1400
1600
1800
2000
Scalability of Reads vs thread count
14 JBOD vs 2 RAID0, 8 clients, 2 servers - 128K Size
JBODRAID0
Total threads
Agg
rega
te T
hrou
ghpu
t (M
Byt
es /
sec
ond)
Tuning Tips - Storage Layout
0 50 100 150 200 250 3000
200
400
600
800
1000
1200
1400
1600
1800
2000
Scalability of Writes vs thread count
14 JBOD vs 2 RAID0, 8 clients, 2 servers - 128K
JBODRAID0
Total Threads
Agg
rega
te T
hrou
ghpu
t (M
Byt
es /
sec
ond)
Key lesson: best config for a workload is very unpredictable in this kind of system. JBOD does better for reads, but it's a wash for writes. For more synchronous or random/small I/O, RAID does significantly better than JBOD. For all of its many advantages, GlusterFS/CloudFS doesn't solve the problem of tuning for a specific workload (and even makes it worse).
Virtualized Performance
● All this bare metal stuff is interesting but this is a CloudFS, lets see some virt data
● Use KVM Guests running RHEL6.1
Virtualized Performance - RHEL6.1 KVM Guests
1 4 160
100
200
300
400
500
600
700
800
900
1000
Comparision of KVM vs Bare Metal Read Throughput1 client, 1 VM/client, 4 servers, 4-MB stripe
VM read bare metal read
Number of Threads
Ag
gre
ga
te T
hro
ug
hp
ut
Virtualized Performance - RHEL6.1 KVM Guests
1 4 160
100
200
300
400
500
600
700
800
900
1000
Comparision of VM vs Bare Metal Write Throughput1 client, 1 VM/client, 4 servers, 4-MB stripe
VM write bare metal write
Number of Threads
Ag
gre
ga
te T
hro
ug
hp
ut
Virtualized Performance
● Guest was CPU Bound in previous slides
● Up guest from 2 -> 4 VCPUs
Tuning Tips – Sizing the Guest
1 4 160
100
200
300
400
500
600
700
800
Impact of adding additional VCPUs on Sequential Write Throughput1 KVM Guest, 1 Client, 4 Servers 4 MB 4-way Stripe
2-CPU 4-CPU
Number of Threads
Agg
rega
te T
hrou
ghpu
t (M
Byt
es /
sec
)
Tuning Tips – Sizing the Guest
1 4 160
100
200
300
400
500
600
700
800
Impact of additional VCPUs on Sequential Read Throughput1 KVM Guest, 1 Client, 4 Servers 4 MB 4-way Stripe
2-CPU 4-CPU
Number of Threads
Agg
rega
te T
hrou
ghpu
t (M
Byt
es /
sec
)
CloudFS Implementation
CloudFS Namespace Isolation
● Clients mount subdirectories on each brick
● Subdirectories are combined into per-tenant volumes
Server 1 Server 2 Server 3
Tenant A server1:/brick/A server2:/brick/A server3:/brick/A
Tenant B server1:/brick/B server2:/brick/B server3:/brick/B
Tenant C server1:/brick/C server2:/brick/C server3:/brick/C
Tenant D server1:/brick/D server2:/brick/D server3:/brick/D
tenantC# mount server1:brick /mnt/xxx
Server (brick) gets vertical slice, tenant gets horizontal slice. Note that different tenants could use different numbers of servers, different replication levels, etc. though this isn't reflected in the slide (or current code). Thanks to Brian ??? (in the audience) who asked about this.
CloudFS ID Isolation
Tenant Tenant UID Server UID
AAA Auto 425 1136
BBB Beauty 227 1137
CCC Computing 92 1138
BBB Beauty 278 1139
... ... ...
tenantC# stat -c '%A %u %n' blah-rw-r--r-- 92 blah
tenantC# stat -c '%A %u %n' /shared/blah-rw-r--r-- 92 /shared/blah
provider# stat -c '%A %u %n' /bricks/C/blah-rw-r--r-- 1138 /bricks/C/blah
Somebody pointed out that I should show duplicate tenant UIDs here. Correct. In database terms, tenant+tenantUID form a composite key; neither alone need be unique.
Also, this mapping can be done separately on each server. Servers can use different forward mappings so long as the corresponding reverse yields the same result, so there's no need to share the table etc. This is particularly relevant if tenants aren't all using the same set of servers.
CloudFS Authentication
● OpenSSL with provider-signed certificates
● Identity used by other CloudFS functions
Tenant(one time)
Client certificate request ID=x
Provider
Server-signed certificate
Client ownedby tenant
(every time)SSL connection using certificate
Provides authentication and encryption
CloudFS Encryption
● Purely client side, not even escrow on server
● Provides privacy and indemnity
● Problem: partial-block writes
partial cipher-block being written
remainder fetched from server
all input bytes affectall output bytes
All good encryption algorithms have the property that all input bytes within a cipher block affect all output bytes - in unpredictable ways, because unpredictability is the whole point of encryption, but still. Therefore, we need all input bytes and the user might not have provided them in the write.
Key problem here is concurrent read-modify-write cycles from the same or different clients. CloudFS solves this with server-side queuing/leasing to serialize partial-block writes (whole-block writes just flow straight through) but that's not shown.
Adding authentication codes to protect against corruption or tampering has same problems and (mostly, fortunately) same solutions.
Gluster to CloudFS
● So far we have been talking about Gluster performance
● Now lets look at the overhead of the CloudFS specific components
CloudFS Encryption Overhead
1 2 3 4 5 60
500
1000
1500
2000
2500
CloudFS Encryption Overhead
NFSv4GlusterFSEncryption
Number of Clients
Ag
gre
ga
te T
hro
ug
hp
ut
(MB
yte
s /
sec)
Scaling is almost linear with number of clients. That's because it's purely a client-side bottleneck (insufficient parallelism for the encryption phase) which should be fixable. There will always be a cost associated with encryption, but it should be a performance/resource tradeoff and better than shown here.
CloudFS Multi-Tenancy Overhead
1 2 3 4 5 60
500
1000
1500
2000
2500
3000
3500
CloudFS Multi-Tenancy Overhead
NFSv4GlusterFSMulti-tenant
Number of Clients
File
s /
seco
nd
Very little cost, and even (surprisingly) some benefit. Current theory is that when tenants are using separate subdirectories within each brick, this reduces some contention. Further investigation needed.
For More Information
● CloudFS blog: http://cloudfs.org
● Mailing lists:● https://fedorahosted.org/mailman/listinfo/cloudfs-general● https://fedorahosted.org/mailman/listinfo/cloudfs-devel
● Code: http://git.fedorahosted.org/git/?p=CloudFS.git
● More to come (wikis, bug tracker, etc.)
42
Backup: CloudFS “Dynamo” Translator (future)
● Greater scalability
● Faster replication
● Faster replica repair
● Faster rebalancing
● Variable # of replicas
“Dynamo”ConsistentHashing
S1
S2
S3
A
BC
D
Greater scalability because directories no longer need to exist globally (not evident from picture). Faster replication because extra operations are only necessary in case of failure, not in normal operation.
The most important gain is really flexibility, since this style of consistent hashing allows adding single servers instead of replica-count at once (required with current DHT), different-size bricks are smoothly handled via virtual node IDs, different tenants can use different replica levels, etc.
Backup: CloudFS Async Replication (future)
● Multiple masters
● Partition tolerant● writes accepted
everywhere● Eventually consistent
● version vectors etc.● Preserves client-side
encryption security
● Unrelated to Gluster geosync
Site A
S1
S2
S3
Site BS4
S5
Site CS6S7
Gluster's georeplication is really rather limited to disaster recovery - single source, single destination, non-continuous, unordered, etc. Their Merkle-tree-like “marker” functionality is an improvement over checksum-based rsync, but not by much and propagating “dirty” markers all the way up to the volume root doesn't exactly come for free.