hands on gluster with jeff darcy

52
Gluster Tutorial Jeff Darcy, Red Hat LISA 2016 (Boston)

Upload: glusterorg

Post on 08-Jan-2017

470 views

Category:

Technology


7 download

TRANSCRIPT

Page 1: Hands On Gluster with Jeff Darcy

Gluster TutorialJeff Darcy, Red HatLISA 2016 (Boston)

Page 2: Hands On Gluster with Jeff Darcy

Agenda

▸ Alternating info-dump and hands-on▹ This is part of the info-dump ;)

▸ Gluster basics▸ Initial setup▸ Extra features▸ Maintenance and trouble-shooting

Page 3: Hands On Gluster with Jeff Darcy

Who Am I?

▸ One of three project-wide architects▸ First Red Hat employee to be seriously

involved with Gluster (before acquisition)

▸ Previously worked on NFS (v2..v4), Lustre, PVFS2, others

▸ General distributed-storage blatherer▹ http://pl.atyp.us / @Obdurodon

Page 4: Hands On Gluster with Jeff Darcy

TEMPLATE CREDITS

Special thanks to all the people who made and released these awesome resources for free:

▸ Presentation template by SlidesCarnival▸ Photographs by Death to the Stock Photo (license)

Page 5: Hands On Gluster with Jeff Darcy

Some Terminology

▸ A brick is simply a directory on a server▸ We use translators to combine bricks

into more complex subvolumes▹ For scale, replication, sharding, ...

▸ This forms a translator graph, contained in a volfile

▸ Internal daemons (e.g. self heal) use the same bricks arranged into slightly different volfiles

Page 6: Hands On Gluster with Jeff Darcy

Hands On: Getting Started

1. Use the RHGS test drive▹ http://bit.ly/glustertestdrive

2. Start a Fedora/CentOS VM▹ Use yum/dnf to install gluster▹ base, libs, server, fuse, client-xlators, cli

3. Docker Docker Docker▹ https://github.com/gluster/gluster-containers

Page 7: Hands On Gluster with Jeff Darcy

Brick / Translator Example

Server A/brick1

Server B/brick2

Server C/brick3

Server D/brick4

Page 8: Hands On Gluster with Jeff Darcy

Brick / Translator Example

Server A/brick1

Server B/brick2

ReplicaSet 1

Server C/brick3

Server D/brick4

ReplicaSet 2

A subvolume

Also a subvolume

Page 9: Hands On Gluster with Jeff Darcy

Brick / Translator Example

Server A/brick1

Server B/brick2

ReplicaSet 1

Server C/brick3

Server D/brick4

ReplicaSet 2

Volume“fubar”

Page 10: Hands On Gluster with Jeff Darcy

Translator Patterns

Server A/brick1

Server B/brick2

ReplicaSet 1

Fan-out or “cluster”e.g. AFR, EC, DHT, ...

AFR

md-cache

Pass throughe.g. performance

Page 11: Hands On Gluster with Jeff Darcy

Access Methods

FUSE

Samba

Ganesha

TCMU

GFAPI

Self heal

Rebalance

Quota

Snapshot

Bitrot

Page 12: Hands On Gluster with Jeff Darcy

GlusterD

▸ Management daemon▸ Maintains membership, detects server

failures▸ Stages configuration changes▸ Starts and monitors other daemons

Page 13: Hands On Gluster with Jeff Darcy

Simple Configuration Example

serverA# gluster peer probe serverB

serverA# gluster volume create fubar \

replica 2 \

serverA:/brick1 serverB:/brick2

serverA# gluster volume start fubar

clientX# mount -t glusterfs serverA:fubar \

/mnt/gluster_fubar

Page 14: Hands On Gluster with Jeff Darcy

Hands On: Connect Servers

[root@vagrant-testVM glusterfs]# gluster peer probe 192.168.121.66

peer probe: success.

[root@vagrant-testVM glusterfs]# gluster peer status

Number of Peers: 1

Hostname: 192.168.121.66

Uuid: 95aee0b5-c816-445b-8dbc-f88da7e95660

State: Accepted peer request (Connected)

Page 15: Hands On Gluster with Jeff Darcy

Hands On: Server Volume Setup

[root@vagrant-testVM glusterfs]# gluster volume create fubar \

replica 2 testvm:/d/backends/fubar{0,1} force

volume create: fubar: success: please start the volume to access data

[root@vagrant-testVM glusterfs]# gluster volume info fubar

... (see for yourself)

[root@vagrant-testVM glusterfs]# gluster volume status fubar

Volume fubar is not started

Page 16: Hands On Gluster with Jeff Darcy

Hands On: Server Volume Setup

[root@vagrant-testVM glusterfs]# gluster volume start fubar

volume start: fubar: success

[root@vagrant-testVM glusterfs]# gluster volume status fubar

Status of volume: fubar

Gluster process TCP Port RDMA Port Online Pid

------------------------------------------------------------------------------

Brick testvm:/d/backends/fubar0 49152 0 Y 13104

Brick testvm:/d/backends/fubar1 49153 0 Y 13133

Self-heal Daemon on localhost N/A N/A Y 13163

Task Status of Volume fubar

------------------------------------------------------------------------------

There are no active volume tasks

Page 17: Hands On Gluster with Jeff Darcy

Hands On: Client Volume Setup

[root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:fubar \

/mnt/glusterfs/0

[root@vagrant-testVM glusterfs]# df /mnt/glusterfs/0

Filesystem 1K-blocks Used Available Use% Mounted on

testvm:fubar 5232640 33280 5199360 1% /mnt/glusterfs/0

[root@vagrant-testVM glusterfs]# ls -a /mnt/glusterfs/0

. ..

[root@vagrant-testVM glusterfs]# ls -a /d/backends/fubar0

. .. .glusterfs

Page 18: Hands On Gluster with Jeff Darcy

Hands On: It’s a Filesystem!

▸ Create some files▸ Create directories, symlinks, ...▸ Rename, delete, ...▸ Test performance

▹ OK, not yet

Page 19: Hands On Gluster with Jeff Darcy

Distribution and Rebalancing

Server X’s range Server Y’s range

0 0x7fffffff 0xffffffff

● Each brick “claims” a range of hash values○ Collection of claims is called a layout

● Files (dots) are hashed, placed on brick claiming that range

● When bricks are added, claims are adjusted to minimize data motion

Page 20: Hands On Gluster with Jeff Darcy

Distribution and Rebalancing

Server X’s range Server Y’s range

0 0x80000000 0xffffffff

Server X’s range Server Y’s range

0 0x55555555 0xaaaaaaaa 0xffffffff

Server Z’s range

Move X->Z Move Y->Z

Page 21: Hands On Gluster with Jeff Darcy

Sharding

▸ Divides files into chunks▸ Each chunk is placed separately

according to hash▸ High probability (not certainty) of

chunks being on different subvolumes▸ Spreads capacity and I/O across

subvolumes

Page 22: Hands On Gluster with Jeff Darcy

Hands On: Adding a Brick

[root@vagrant-testVM glusterfs]# gluster volume create xyzzy testvm:/d/backends/xyzzy{0,1}

[root@vagrant-testVM glusterfs]# getfattr -d -e hex \

-m trusted.glusterfs.dht /d/backends/xyzzy{0,1}

# file: d/backends/xyzzy0

trusted.glusterfs.dht=0x0000000100000000000000007ffffffe

# file: d/backends/xyzzy1

trusted.glusterfs.dht=0x00000001000000007fffffffffffffff

Page 23: Hands On Gluster with Jeff Darcy

Hands On: Adding a Brick

[root@vagrant-testVM glusterfs]# gluster volume add-brick xyzzy \

testvm:/d/backends/xyzzy2

volume add-brick: success

[root@vagrant-testVM glusterfs]# gluster volume rebalance xyzzy \

fix-layout start

volume rebalance: xyzzy: success: Rebalance on xyzzy has been started successfully. Use rebalance status command to check status of the rebalance process.

ID: 88782248-7c12-4ba8-97f6-f5ce6815963

Page 24: Hands On Gluster with Jeff Darcy

Hands On: Adding a Brick

[root@vagrant-testVM glusterfs]# getfattr -d -e hex -m \

trusted.glusterfs.dht /d/backends/xyzzy{0,1,2}

# file: d/backends/xyzzy0

trusted.glusterfs.dht=0x00000001000000000000000055555554

# file: d/backends/xyzzy1

trusted.glusterfs.dht=0x0000000100000000aaaaaaaaffffffff

# file: d/backends/xyzzy2

trusted.glusterfs.dht=0x000000010000000055555555aaaaaaa9

Page 25: Hands On Gluster with Jeff Darcy

Split Brain (problem definition)

▸ “Split brain” is when we don’t have enough information to determine correct recovery action

▸ Can be caused by node failure or network partition

▸ Every distributed data store has to prevent and/or deal with it

Page 26: Hands On Gluster with Jeff Darcy

How Replication Works

▸ Client sends operation (e.g. write) to all replicas directly

▸ Coordination: pre-op, post-op, locking▹ enables recovery in case of failure

▸ Self-heal (repair) usually done by internal daemon

Page 27: Hands On Gluster with Jeff Darcy

Split Brain (how it happens)

Server A

Client X

Client Y

Server B

Networkpartition

Page 28: Hands On Gluster with Jeff Darcy

Split Brain (what it looks like)

[root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/0

ls: cannot access /mnt/glusterfs/0/best-sf: Input/output error

best-sf

[root@vagrant-testVM glusterfs]# cat /mnt/glusterfs/0/best-sf

cat: /mnt/glusterfs/0/best-sf: Input/output error

[root@vagrant-testVM glusterfs]# cat /d/backends/fubar0/best-sf

star trek

[root@vagrant-testVM glusterfs]# cat /d/backends/fubar1/best-sf

star wars

What the...?

Page 29: Hands On Gluster with Jeff Darcy

Split Brain (dealing with it)

▸ Primary mechanism: quorum▹ server side, client side, or both▹ arbiters

▸ Secondary: rule-based resolution▹ e.g. largest, latest timestamp▹ Thanks, Facebook!

▸ Last choice: manual repair

Page 30: Hands On Gluster with Jeff Darcy

Server Side Quorum

Brick A Brick B Brick C

Client X Client Y

Writes succeed Has no servers

Forced down

Page 31: Hands On Gluster with Jeff Darcy

Client Side Quorum

Brick A Brick B Brick C

Client X Client Y

Writes succeed Writes rejected locally(EROFS)

Stays up

Page 32: Hands On Gluster with Jeff Darcy

Erasure Coding

▸ Encode N input blocks into N+K output blocks, so that original can be recovered from any N.

▸ RAID is erasure coding with K=1 (RAID 5) or K=2 (RAID 6)

▸ Our implementation mostly has the same flow as replication

Page 33: Hands On Gluster with Jeff Darcy

Erasure Coding

Page 34: Hands On Gluster with Jeff Darcy

Erasure Coding

Page 35: Hands On Gluster with Jeff Darcy

BREAK

Page 36: Hands On Gluster with Jeff Darcy

Quota

▸ Gluster supports directory-level quota▸ For nested directories, lowest applicable

limit applies▸ Soft and hard limits

▹ Exceeding soft limit gets logged▹ Exceeding hard limit gets EDQUOT

Page 37: Hands On Gluster with Jeff Darcy

Quota

▸ Problem: global vs. local limits▹ quota is global (per volume)▹ files are pseudo-randomly distributed

across bricks▸ How do we enforce this?▸ Quota daemon exists to handle this

coordination

Page 38: Hands On Gluster with Jeff Darcy

Hands On: Quota

[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy enable

volume quota : success

[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy soft-timeout 0

volume quota : success

[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy hard-timeout 0

volume quota : success

[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy \

limit-usage /john 100MB

volume quota : success

Page 39: Hands On Gluster with Jeff Darcy

Hands On: Quota

[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list

Path Hard-limit Soft-limit

-----------------------------------------------------------------

/john 100.0MB 80%(80.0MB)

Used Available Soft-limit exceeded? Hard-limit exceeded?

--------------------------------------------------------------

0Bytes 100.0MB No No

Page 40: Hands On Gluster with Jeff Darcy

Hands On: Quota

[root@vagrant-testVM glusterfs]# dd if=/dev/zero \

of=/mnt/glusterfs/0/john/bigfile bs=1048576 count=85 conv=sync

85+0 records in

85+0 records out

89128960 bytes (89 MB) copied, 1.83037 s, 48.7 MB/s

[root@vagrant-testVM glusterfs]# grep -i john /var/log/glusterfs/bricks/*

/var/log/glusterfs/bricks/d-backends-xyzzy0.log:[2016-11-29 14:31:44.581934] A [MSGID: 120004] [quota.c:4973:quota_log_usage] 0-xyzzy-quota: Usage crossed soft limit: 80.0MB used by /john

Page 41: Hands On Gluster with Jeff Darcy

Hands On: Quota

[root@vagrant-testVM glusterfs]# dd if=/dev/zero \

of=/mnt/glusterfs/0/john/bigfile2 bs=1048576 count=85 conv=sync

dd: error writing '''/mnt/glusterfs/0/john/bigfile2''': Disk quota exceeded

[root@vagrant-testVM glusterfs]# gluster volume quota xyzzy list | cut -c 66-

Used Available Soft-limit exceeded? Hard-limit exceeded?

--------------------------------------------------------------

101.9MB 0Bytes Yes Yes

Page 42: Hands On Gluster with Jeff Darcy

Snapshots

▸ Gluster supports read-only snapshots and writable clones of snapshots

▸ Also, snapshot restores▸ Support is based on / tied to LVM thin

provisioning▹ originally supposed to be more

platform-agnostic▹ maybe some day it really will be

Page 43: Hands On Gluster with Jeff Darcy

Hands On: Snapshots

[root@vagrant-testVM glusterfs]# fallocate -l $((100*1024*1024)) \

/tmp/snap-brick0

[root@vagrant-testVM glusterfs]# losetup --show -f /tmp/snap-brick0 \

/dev/loop3

[root@vagrant-testVM glusterfs]# vgcreate snap-vg0 /dev/loop3

Volume group "snap-vg0" successfully created

Page 44: Hands On Gluster with Jeff Darcy

Hands On: Snapshots

[root@vagrant-testVM glusterfs]# lvcreate -L 50MB -T /dev/snap-vg0/thinpool

Rounding up size to full physical extent 52.00 MiB

Logical volume "thinpool" created.

[root@vagrant-testVM glusterfs]# lvcreate -V 200MB -T /dev/snap-vg0/thinpool -n snap-lv0

Logical volume "snap-lv0" created.

[root@vagrant-testVM glusterfs]# mkfs.xfs /dev/snap-vg0/snap-lv0

...

[root@vagrant-testVM glusterfs]# mount /dev/snap-vg0/snap-lv0 /d/backends/xyzzy0

...

Page 45: Hands On Gluster with Jeff Darcy

Hands On: Snapshots

[root@vagrant-testVM glusterfs]# gluster volume create xyzzy \

testvm:/d/backends/xyzzy{0,1} force

[root@vagrant-testVM glusterfs]# echo hello > /mnt/glusterfs/0/file1

[root@vagrant-testVM glusterfs]# echo hello > /mnt/glusterfs/0/file2

[root@vagrant-testVM glusterfs]# gluster snapshot create snap1 xyzzy

snapshot create: success: Snap snap1_GMT-2016.11.29-14.57.11 created successfully

[root@vagrant-testVM glusterfs]# echo hello > /mnt/glusterfs/0/file3

Page 46: Hands On Gluster with Jeff Darcy

Hands On: Snapshots

[root@vagrant-testVM glusterfs]# gluster snapshot activate \

snap1_GMT-2016.11.29-14.57.11

Snapshot activate: snap1_GMT-2016.11.29-14.57.11: Snap activated successfully

[root@vagrant-testVM glusterfs]# mount -t glusterfs \

testvm:/snaps/snap1_GMT-2016.11.29-14.57.11/xyzzy /mnt/glusterfs/1

[root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/1

file1 file2

[root@vagrant-testVM glusterfs]# echo hello > /mnt/glusterfs/1/file3

-bash: /mnt/glusterfs/1/file3: Read-only file system

Page 47: Hands On Gluster with Jeff Darcy

Hands On: Snapshots

[root@vagrant-testVM glusterfs]# gluster snapshot clone clone1 \

snap1_GMT-2016.11.29-14.57.11

snapshot clone: success: Clone clone1 created successfully

[root@vagrant-testVM glusterfs]# gluster volume start clone1

volume start: clone1: success

[root@vagrant-testVM glusterfs]# mount -t glusterfs testvm:/clone1 \

/mnt/glusterfs/2

[root@vagrant-testVM glusterfs]# echo goodbye > /mnt/glusterfs/2/file3

Page 48: Hands On Gluster with Jeff Darcy

Hands On: Snapshots

# Unmount and stop clone.

# Stop original volume - but leave snapshot activated!

[root@vagrant-testVM glusterfs]# gluster snapshot restore snap1_GMT-2016.11.29-14.57.11

Restore operation will replace the original volume with the snapshotted volume. Do you still want to continue? (y/n) y

Snapshot restore: snap1_GMT-2016.11.29-14.57.11: Snap restored successfully

[root@vagrant-testVM glusterfs]# gluster volume start xyzzy

volume start: xyzzy: success

[root@vagrant-testVM glusterfs]# ls /mnt/glusterfs/0

file1 file2

Page 49: Hands On Gluster with Jeff Darcy

BREAK

Page 50: Hands On Gluster with Jeff Darcy

Other Features

▸ Geo-replication▸ Bitrot detection▸ Transport security▸ Encryption, compression/dedup etc. can

be done locally on bricks

Page 51: Hands On Gluster with Jeff Darcy

Gluster 4.x

▸ GlusterD 2▹ higher scale + interfaces + smarts

▸ Server-side replication▸ DHT improvements for scale▸ More multitenancy

▹ subvolume mounts, throttling/QoS

Page 52: Hands On Gluster with Jeff Darcy

Thank You!http://[email protected]