jeff darcy / mark wagner - pl.atyp.uspl.atyp.us/hekafs.org/docs/cloudfs_summit_notes.pdf · jeff...

2

Jeff Darcy / Mark WagnerPrincipal Software Engineers, Red Hat4 May, 2011

BUILDING A CLOUD

FILESYSTEM

What's It For?

● “Filesystem as a Service”

● Managed by one provider, used by many tenants

Familiarity

Scalability Flexibility

Privacy

Familiarity: POSIX API, typical filesystem capacity/durability characteristics, (fairly) typical filesystem performance profile.

Flexibility: add/remove servers, add/remove tenants, without disruptive downtime

Privacy: encryption must be done both on network and on disk, the latter pure client side (unlike Dropbox, too many others)

Considered adding reference to reliability and recent EBS outage, but (a) too lazy and (b) works like scalability for purposes of this presentation.

What About Existing Filesystems?

● GlusterFS, PVFS2, Ceph, ...● Not all the same (distributed vs. cluster)● Even the best don't cover all the bases

Familiarity A-

Scalability B+

Flexibility C+

Privacy F

Familiarity: the minus is because of slight sacrifices in consistency or other areas (e.g. shared mmap, write to unlinked) made by some alternatives.

Scalability: some better than others, primarily due to single metadata server in some alternatives.

Flexibility: Most require downtime during any kind of reconfiguration.

Privacy: “Incomplete” would be a more accurate grade, because most never even had this as a goal. Shout out to OrangeFS for thinking deeply about this.

Privacy Part 1: Separate Namespace

tenantX# ls /mnt/shared_fs/tenantYa.txt b.txt my_secret_file.txt

● Tenant X's files should be completely invisible to any other tenant

● Ditto for space usage

● Solvable with subvolume mounts and directory permissions, but watch out for symlinks etc.

Privacy Part 2: Separate ID Space

● Tenant X's “joe” has the same UID as tenant Y's “fred”

● Two tenants should not have the same UIDs● ...but server only has one UID space● must map between per-server and per-tenant spaces

server# ls /shared/tenantX/joe/foo-rw-r--r-- 1 joe joe 9262 Jan 20 12:00 foo

server# ls /shared/tenantY/fred/bar-rw-r--r-- 1 joe joe 6481 Mar 09 13:47 bar

“Register your users with our ID service”

Add another step? I createthousands of users every day!

I already run my own ID service,to sync across the company.

Amazon doesn't require that!

It was nice knowing you.

“Own ID service” point is important because this would mean running two ID services simultaneously. Multi-realm Kerberos etc. are *very* painful.

Privacy Part 3: At Rest Encryption

● Where did it come from? Whose data is on it?

● Moral: encrypt and store key separately

Yes, this does happen. Nowadays you can find videos of disks being destroyed when they're decommissioned by large sites, but those are largely a *reaction* to previous incidents and you can't trust every single cloud provider to be so diligent.

Privacy Part 4: Wire Encryption + Authentication

● Know who you're talking to

● Make sure nobody else can listen (or spoof)

Picture credit:http://www.owasp.org/index.php/Man-in-the-middle_attack

CloudFS

● Builds on established technology

● Adds specific functionality for cloud deployment

Familiarity

Scalability Flexibility

Privacy

GlusterFS

SSL CloudFS

Basic idea is to leverage GlusterFS's features and also acceptance/community/etc. (as important as anything in the code) then add bits to address remaining needs.

GlusterFS Core Concept: Translators

● So named because they translate upper-level I/O requests into lower-level I/O requests using the same interface

● stackable in any order● can be deployed on either client or server

● Lowest-level “bricks” are just directories on servers

● GlusterFS is an engine to route filesystem requests through translators to bricks

Important to note that the modularity is not just an engineering issue but also packaging, licensing, support, etc. None of the alternatives, even those with good internal architecture, really support this level of “separate add-on” model.

Translator Patterns

Brick 1(do nothing)

Brick 2(do nothing)

Caching(read XXXX)

Brick 1(read XX)

Brick 2(read YY)

Splitting(read XXYY)

Brick 1(write XXYY)

Brick 2(write XXYY)

Replicating(write XXYY)

Brick 1(do nothing)

Brick 2(write XXYY)

Routing(write XXYY)

Examples• routing pattern: DHT• replicating pattern: AFR• splitting pattern: stripe• null pattern: most performance xlators• one-to-one: most feature xlators

Translator Types

● Protocols: client, (native) server, NFS server

● Core: distribute (DHT), replicate (AFR), stripe

● Features: locks, access control, quota

● Performance: prefetching, caching, write-behind

● Debugging: trace, latency measurement

● CloudFS: ID mapping, authentication/encryption, future “Dynamo” and async replication

Typical Translator Structure

Mount(FUSE)

Cache

Distribute

Replicate

Client A Client B Client C Client D

Replicate

Server A/export

Server B/foo

Server C/bar

Server D/x/y/z

clientside

“Client side” is just a convention, albeit a good one. It's not that uncommon e.g. to move replication over to the server side.

Let's Discuss Performance

Test Hardware

● Testing on Westmere EP Server class machines● Two Socket, HT on

● 12 boxes total● 48 GB fast memory● 15K drives● 10Gbit – 9K Jumbo frame enabled

● 4 Servers with fully populated Internal SAS drives (7)

● 8 boxes used as clients / VM hosts

NFSv4, ext4, etc. GlusterFS configuration is distribute only (no replication) to make compare equivalent functionality.

Hardware

Servers10 Gbit Network Switch

Clients

First Performance Question We Get Asked

● How does it Stack up to NFS ?

● One of the first tests we ran before we tuned

● Tests conducted on same box / storage

GlusterFS vs. NFS (Writes)

1 2 40

100

200

300

400

500

600

700

800

900

1000

Single Server NFS vs Gluster Throughput Comparision

64K Writes

NFS write Gluster write

Number of Client Threads

Thr

ough

put

in M

Byt

es /

sec

GlusterFS vs. NFS (Reads)

1 2 40

100

200

300

400

500

600

700

800

900

1000

Single Server NFS vs Gluster Read Throughput Comparision

64KB Byte Reads

NFS Gluster

Number of Client Threads

Thr

ough

put

in M

Byt

es /

sec

IO Bound

Second Question

● How Does it Scale ?

● Tests run on combinations of Servers, clients, Vms

● Representative sample shown here

● Scale up across servers, hosts and threads

Read Scalability - Baremetal

1 2 4 80

500

1000

1500

2000

2500

3000

Read scalability- constant 64 threadsSplit over varying number of clients, servers

1 server 2 servers4 servers

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

Byt

es /

sec

)

Pretty much 80% wire speed times number of servers. Notice how one client sees very little benefit from adding servers (as expected).

Write Scalability – Bare metal

1 2 4 80

500

1000

1500

2000

2500

3000

3500

Write scalability - constant 64 threads Split over varying number of clients, servers

1 server 2 servers4 servers

Number of Clients

Agg

rega

te T

hrou

ghpu

t (M

Byt

es /

sec

)

Tuning Fun

● Now that we have the basics, lets play

● Initial tests on RAID0

● Lets try JBOD

Alternatively, compare the effects of letting the RAID controller handle scheduling across disks (on one server) vs. letting GlusterFS do it.

Tuning Tips - Storage Layout

0 50 100 150 200 250 3000

200

400

600

800

1000

1200

1400

1600

1800

2000

Scalability of Reads vs thread count

14 JBOD vs 2 RAID0, 8 clients, 2 servers - 128K Size

JBODRAID0

Total threads

Agg

rega

te T

hrou

ghpu

t (M

Byt

es /

sec

ond)

Tuning Tips - Storage Layout

0 50 100 150 200 250 3000

200

400

600

800

1000

1200

1400

1600

1800

2000

Scalability of Writes vs thread count

14 JBOD vs 2 RAID0, 8 clients, 2 servers - 128K

JBODRAID0

Total Threads

Agg

rega

te T

hrou

ghpu

t (M

Byt

es /

sec

ond)

Key lesson: best config for a workload is very unpredictable in this kind of system. JBOD does better for reads, but it's a wash for writes. For more synchronous or random/small I/O, RAID does significantly better than JBOD. For all of its many advantages, GlusterFS/CloudFS doesn't solve the problem of tuning for a specific workload (and even makes it worse).

Virtualized Performance

● All this bare metal stuff is interesting but this is a CloudFS, lets see some virt data

● Use KVM Guests running RHEL6.1

Virtualized Performance - RHEL6.1 KVM Guests

1 4 160

100

200

300

400

500

600

700

800

900

1000

Comparision of KVM vs Bare Metal Read Throughput1 client, 1 VM/client, 4 servers, 4-MB stripe

VM read bare metal read

Number of Threads

Ag

gre

ga

te T

hro

ug

hp

ut

Virtualized Performance - RHEL6.1 KVM Guests

1 4 160

100

200

300

400

500

600

700

800

900

1000

Comparision of VM vs Bare Metal Write Throughput1 client, 1 VM/client, 4 servers, 4-MB stripe

VM write bare metal write

Number of Threads

Ag

gre

ga

te T

hro

ug

hp

ut

Virtualized Performance

● Guest was CPU Bound in previous slides

● Up guest from 2 -> 4 VCPUs

Tuning Tips – Sizing the Guest

1 4 160

100

200

300

400

500

600

700

800

Impact of adding additional VCPUs on Sequential Write Throughput1 KVM Guest, 1 Client, 4 Servers 4 MB 4-way Stripe

2-CPU 4-CPU

Number of Threads

Agg

rega

te T

hrou

ghpu

t (M

Byt

es /

sec

)

Tuning Tips – Sizing the Guest

1 4 160

100

200

300

400

500

600

700

800

Impact of additional VCPUs on Sequential Read Throughput1 KVM Guest, 1 Client, 4 Servers 4 MB 4-way Stripe

2-CPU 4-CPU

Number of Threads

Agg

rega

te T

hrou

ghpu

t (M

Byt

es /

sec

)

CloudFS Implementation

CloudFS Namespace Isolation

● Clients mount subdirectories on each brick

● Subdirectories are combined into per-tenant volumes

Server 1 Server 2 Server 3

Tenant A server1:/brick/A server2:/brick/A server3:/brick/A

Tenant B server1:/brick/B server2:/brick/B server3:/brick/B

Tenant C server1:/brick/C server2:/brick/C server3:/brick/C

Tenant D server1:/brick/D server2:/brick/D server3:/brick/D

tenantC# mount server1:brick /mnt/xxx

Server (brick) gets vertical slice, tenant gets horizontal slice. Note that different tenants could use different numbers of servers, different replication levels, etc. though this isn't reflected in the slide (or current code). Thanks to Brian ??? (in the audience) who asked about this.

CloudFS ID Isolation

Tenant Tenant UID Server UID

AAA Auto 425 1136

BBB Beauty 227 1137

CCC Computing 92 1138

BBB Beauty 278 1139

... ... ...

tenantC# stat -c '%A %u %n' blah-rw-r--r-- 92 blah

tenantC# stat -c '%A %u %n' /shared/blah-rw-r--r-- 92 /shared/blah

provider# stat -c '%A %u %n' /bricks/C/blah-rw-r--r-- 1138 /bricks/C/blah

Somebody pointed out that I should show duplicate tenant UIDs here. Correct. In database terms, tenant+tenantUID form a composite key; neither alone need be unique.

Also, this mapping can be done separately on each server. Servers can use different forward mappings so long as the corresponding reverse yields the same result, so there's no need to share the table etc. This is particularly relevant if tenants aren't all using the same set of servers.

CloudFS Authentication

● OpenSSL with provider-signed certificates

● Identity used by other CloudFS functions

Tenant(one time)

Client certificate request ID=x

Provider

Server-signed certificate

Client ownedby tenant

(every time)SSL connection using certificate

Provides authentication and encryption

CloudFS Encryption

● Purely client side, not even escrow on server

● Provides privacy and indemnity

● Problem: partial-block writes

partial cipher-block being written

remainder fetched from server

all input bytes affectall output bytes

All good encryption algorithms have the property that all input bytes within a cipher block affect all output bytes - in unpredictable ways, because unpredictability is the whole point of encryption, but still. Therefore, we need all input bytes and the user might not have provided them in the write.

Key problem here is concurrent read-modify-write cycles from the same or different clients. CloudFS solves this with server-side queuing/leasing to serialize partial-block writes (whole-block writes just flow straight through) but that's not shown.

Adding authentication codes to protect against corruption or tampering has same problems and (mostly, fortunately) same solutions.

Gluster to CloudFS

● So far we have been talking about Gluster performance

● Now lets look at the overhead of the CloudFS specific components

CloudFS Encryption Overhead

1 2 3 4 5 60

500

1000

1500

2000

2500

CloudFS Encryption Overhead

NFSv4GlusterFSEncryption

Number of Clients

Ag

gre

ga

te T

hro

ug

hp

ut

(MB

yte

s /

sec)

Scaling is almost linear with number of clients. That's because it's purely a client-side bottleneck (insufficient parallelism for the encryption phase) which should be fixable. There will always be a cost associated with encryption, but it should be a performance/resource tradeoff and better than shown here.

CloudFS Multi-Tenancy Overhead

1 2 3 4 5 60

500

1000

1500

2000

2500

3000

3500

CloudFS Multi-Tenancy Overhead

NFSv4GlusterFSMulti-tenant

Number of Clients

File

s /

seco

nd

Very little cost, and even (surprisingly) some benefit. Current theory is that when tenants are using separate subdirectories within each brick, this reduces some contention. Further investigation needed.

For More Information

● CloudFS blog: http://cloudfs.org

● Mailing lists:● https://fedorahosted.org/mailman/listinfo/cloudfs-general● https://fedorahosted.org/mailman/listinfo/cloudfs-devel

● Code: http://git.fedorahosted.org/git/?p=CloudFS.git

● More to come (wikis, bug tracker, etc.)

Backup: CloudFS “Dynamo” Translator (future)

● Greater scalability

● Faster replication

● Faster replica repair

● Faster rebalancing

● Variable # of replicas

“Dynamo”ConsistentHashing

S1

S2

S3

A

BC

D

Greater scalability because directories no longer need to exist globally (not evident from picture). Faster replication because extra operations are only necessary in case of failure, not in normal operation.

The most important gain is really flexibility, since this style of consistent hashing allows adding single servers instead of replica-count at once (required with current DHT), different-size bricks are smoothly handled via virtual node IDs, different tenants can use different replica levels, etc.

Backup: CloudFS Async Replication (future)

● Multiple masters

● Partition tolerant● writes accepted

everywhere● Eventually consistent

● version vectors etc.● Preserves client-side

encryption security

● Unrelated to Gluster geosync

Site A

S1

S2

S3

Site BS4

S5

Site CS6S7

Gluster's georeplication is really rather limited to disaster recovery - single source, single destination, non-continuous, unordered, etc. Their Merkle-tree-like “marker” functionality is an improvement over checksum-based rsync, but not by much and propagating “dirty” markers all the way up to the volume root doesn't exactly come for free.

jeff darcy / mark wagner - pl.atyp.uspl.atyp.us/hekafs.org/docs/cloudfs_summit_notes.pdf · jeff...

Documents