codemotion rome 2015. glusterfs

72
1 Gluster FS A distributed file system for today and tomorrow's BigData Roberto “FRANK” Franchini @robfrankie @CELI_NLP CELI 2015 Roma, 27–28 Marzo 2015 Roma, 27–28 Marzo 2015

Upload: roberto-franchini

Post on 18-Jul-2015

305 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Codemotion Rome 2015. GlusterFS

1

Gluster FSA distributed file system for today and tomorrow's BigDataRoberto “FRANK” Franchini

@robfrankie @CELI_NLP

CELI 2015

Roma, 27–28 Marzo 2015Roma, 27–28 Marzo 2015

Page 2: Codemotion Rome 2015. GlusterFS

15 years of experience, proud to be a programmer

Writes software for information extraction, nlp, opinion mining

(@scale ), and a lot of other buzzwords

Implements scalable architectures

Plays with servers (don't say that to my sysadmin)

Member of the JUG-Torino coordination team

whoami(1)whoami(1)

CELI 2014 2

Page 3: Codemotion Rome 2015. GlusterFS

Company

CELI 2015 3

Page 4: Codemotion Rome 2015. GlusterFS

Identify a distributed and scalable file

system

for today and tomorrow's

Big Data

CELI 2015 4

The problem

Page 5: Codemotion Rome 2015. GlusterFS

Once upon a time

CELI 2015 5

Page 6: Codemotion Rome 2015. GlusterFS

Once upon a timeOnce upon a time

CELI 2015 6

Page 7: Codemotion Rome 2015. GlusterFS

Once upon a time

2008: One nfs share

1,5TB ought to be enough for anybody

Once upon a time

CELI 2015 7

Page 8: Codemotion Rome 2015. GlusterFS

Once upon a timeOnce upon a time

CELI 2015 8

Page 9: Codemotion Rome 2015. GlusterFS

Once upon a time

2010: Herd of shares

(1,5TB x N) ought to be enough for anybody

Once upon a time

CELI 2015 9

Page 10: Codemotion Rome 2015. GlusterFS

Once upon a time

DIY Hand Made Distributed File System:

HMDFS (tm)

Nobody could stop the data flood

It was the time for

something new

Once upon a time

CELI 2015 10

Page 11: Codemotion Rome 2015. GlusterFS

What we do with GFS

Can be enlarged on demand

No dedicated HW

OpenSource is preferred and trusted

No specialized API

No specialized Kernel

POSIX compliance

Zillions of big and small files

No NAS or SAN (€€€€€)

Requirements

CELI 2015 11

Page 12: Codemotion Rome 2015. GlusterFS

What we do with GFS

More than 10GB of Lucene inverted indexes to be stored

on a daily basis (more than 200GB/month)

Search stored indexes to extract different sets of

documents for different customers

Analytics on large sets of documents (years of data)

To do what?

CELI 2015 12

Page 13: Codemotion Rome 2015. GlusterFS

FeaturesGlusterFs

CELI 2015 13

Page 14: Codemotion Rome 2015. GlusterFS

Features

Clustered Scale-out General Purpose Storage Platform

POSIX-y Distributed File System

Built on commodity systems

x86_64 Linux ++

POSIX filesystems underneath (XFS, EXT4)

No central metadata Server (NO SPOF)

Modular architecture for scale and functionality

GlusterFs

CELI 2015 14

Page 15: Codemotion Rome 2015. GlusterFS

Features

Large Scale File Server

Media/Content Distribution Network (CDN)

Backup / Archive / Disaster Recovery (DR)

HDFS replacement in Hadoop ecosystem

High Performance Computing (HPC)

IaaS storage layer

Database offload (blobs)

Unified Object Store + File Access

Common use cases

CELI 2015 15

Page 16: Codemotion Rome 2015. GlusterFS

Features

ACL and Quota support

Fault-tolerance

Peer to peer

Self-healing

Fast setup up

Enlarge/Shrink on demand

Snapshot

On cloud, on premise (phisical or virtual)

Features

CELI 2015 16

Page 17: Codemotion Rome 2015. GlusterFS

Architecture

CELI 2015 17

Page 18: Codemotion Rome 2015. GlusterFS

Architecture

Peer / Node

Cluster servers (glusterfs server)

Runs the gluster daemons and participates in volumes

Node

CELI 2015 18

Page 19: Codemotion Rome 2015. GlusterFS

Architecture

Brick

A filesystem mountpoint on GusterFS nodes

A unit of storage used as a capacity building block

Brick

CELI 2015 19

Page 20: Codemotion Rome 2015. GlusterFS

ArchitectureFrom disks to bricks

CELI 2015 20

3TB

3TB

3TB

3TB

3TB

Gluster Server

3TB

3TB

3TB

3TB

3TB

Gluster Server

LVM volume

Gluster Server

LVM volume

LVM volume

LVM volume

Page 21: Codemotion Rome 2015. GlusterFS

ArchitectureFrom disks to bricks

CELI 2015 21

3TB

3TB

3TB

3TB

3TB

Gluster Server

3TB

3TB

3TB

3TB

3TB

RAID 6Physicalvolume

Gluster Server

SingleBrick

Gluster Server

Page 22: Codemotion Rome 2015. GlusterFS

ArchitectureFrom disks to bricks

CELI 2015 22

3TB

3TB

3TB

3TB

3TB

Gluster Server

3TB

3TB

3TB

3TB

3TB

/b1

/b3

/b5

/bN

/b7

Gluster Server

/b2

/b4

/b6

/bM

/b8

Page 23: Codemotion Rome 2015. GlusterFS

Bricks on a nodeBricks on a node

CELI 2015 23

Page 24: Codemotion Rome 2015. GlusterFS

Architecture

Logic between bricks or subvolume that generate a

subvolume with certain characteristic

Distribute, replica, stripe are special translators to generate

simil-RAID configuration

Perfomance translators: read ahead, write behind

Translators

CELI 2015 24

Page 25: Codemotion Rome 2015. GlusterFS

Architecture

Bricks combined and passed through translators

Ultimately, what's presented to the end user

Volume

CELI 2015 25

Page 26: Codemotion Rome 2015. GlusterFS

Architecture

brick →

translator → subvolume →

translator → subvolume →

transator → subvolume →

translator → volume

Volume

CELI 2015 26

Page 27: Codemotion Rome 2015. GlusterFS

VolumeVolume

CELI 2015 27

Gluster Server 1

/brick 14

/brick 13

/brick 12

/brick 11

Gluster Server 2

/brick 24

/brick 23

/brick 22

/brick 21

Gluster Server 3

/brick 34

/brick 33

/brick 32

/brick 31

Client

/bigdata

volume

replicate

distribute

Page 28: Codemotion Rome 2015. GlusterFS

VolumeVolume

CELI 2015 28

Page 29: Codemotion Rome 2015. GlusterFS

Volume types

CELI 2015 29

Page 30: Codemotion Rome 2015. GlusterFS

Distributed

The default configuration

Files “evenly” spread across bricks

Similar to file-level RAID 0

Server/Disk failure could be catastrophic

Distributed

CELI 2014 30

Page 31: Codemotion Rome 2015. GlusterFS

DistributedDistributed

CELI 2015 31

Page 32: Codemotion Rome 2015. GlusterFS

Replicated

Files written synchronously to replica peers

Files read synchronously,

but ultimately serviced by the first responder

Similar to file-level RAID 1

Replicated

CELI 2015 32

Page 33: Codemotion Rome 2015. GlusterFS

ReplicatedReplicated

CELI 2015 33

Page 34: Codemotion Rome 2015. GlusterFS

Distributed + replicated

Distribued + replicated

Similar to file-level RAID 10

Most used layout

Distributed + replicated

CELI 2015 34

Page 35: Codemotion Rome 2015. GlusterFS

Distributed replicatedDistributed + replicated

CELI 2015 35

Page 36: Codemotion Rome 2015. GlusterFS

Striped

Individual files split among bricks (sparse files)Similar to block-level RAID 0

Limited Use CasesHPC Pre/Post Processing

File size exceeds brick size

Striped

CELI 2015 36

Page 37: Codemotion Rome 2015. GlusterFS

StripedStriped

CELI 2015 37

Page 38: Codemotion Rome 2015. GlusterFS

Moving parts

CELI 2015 38

Page 39: Codemotion Rome 2015. GlusterFS

Components

glusterdManagement daemonOne instance on each GlusterFS serverInterfaced through gluster CLI

glusterfsdGlusterFS brick daemonOne process for each brick on each serverManaged by glusterd

Components

CELI 2015 39

Page 40: Codemotion Rome 2015. GlusterFS

Components

glusterfsVolume service daemonOne process for each volume serviceNFS server, FUSE client, Self-Heal, Quota, ...

mount.glusterfsFUSE native client mount extension

glusterGluster Console Manager (CLI)

Components

CELI 2015 40

Page 41: Codemotion Rome 2015. GlusterFS

Components

sudo fdisk /dev/sdbsudo mkfs.xfs -i size=512 /dev/sdb1sudo mkdir -p /srv/sdb1sudo mount /dev/sdb1 /srv/sdb1sudo mkdir -p /srv/sdb1/brickecho "/dev/sdb1 /srv/sdb1 xfs defaults 0 0" | sudo tee -a /etc/fstab

sudo apt-get install glusterfs-server

Quick start

CELI 2015 41

Page 42: Codemotion Rome 2015. GlusterFS

Components

sudo gluster peer probe node02

sudo gluster volume create testvol replica 2 node01:/srv/sdb1/brick node02:/srv/sdb1/brick

sudo gluster volume start testvol

sudo mkdir /mnt/glustersudo mount -t glusterfs node01:/testvol /mnt/gluster

Quick start

CELI 2015 42

Page 43: Codemotion Rome 2015. GlusterFS

Clients

CELI 2015 43

Page 44: Codemotion Rome 2015. GlusterFS

Clients: native

FUSE kernel module allows the filesystem to be built and

operated entirely in userspace

Specify mount to any GlusterFS server

Native Client fetches volfile from mount server, then

communicates directly with all nodes to access data

Recommended for high concurrency and high write

performance

Load is inherently balanced across distributed volumes

Clients: native

CELI 2015 44

Page 45: Codemotion Rome 2015. GlusterFS

Clients:NFS

Standard NFS v3 clients

Standard automounter is supported

Mount to any server, or use a load balancer

GlusterFS NFS server includes Network Lock Manager

(NLM) to synchronize locks across clients

Better performance for reading many small files from a

single client

Load balancing must be managed externally

Clients: NFS

CELI 2015 45

Page 46: Codemotion Rome 2015. GlusterFS

Clients: libgfapi

Introduced with GlusterFS 3.4

User-space library for accessing data in GlusterFS

Filesystem-like API

Runs in application process

no FUSE, no copies, no context switches

...but same volfiles, translators, etc.

Clients: libgfapi

CELI 2015 46

Page 47: Codemotion Rome 2015. GlusterFS

Clients: SMB/CIFS

In GlusterFS 3.4 – Samba + libgfapi

No need for local native client mount & re-export

Significant performance improvements with FUSE removed

from the equation

Must be setup on each server you wish to connect to via

CIFS

CTDB is required for Samba clustering

Clients: SMB/CIFS

CELI 2014 47

Page 48: Codemotion Rome 2015. GlusterFS

Clients: HDFS

Access data within and outside of Hadoop

No HDFS name node single point of failure / bottleneck

Seamless replacement for HDFS

Scales with the massive growth of big data

Clients: HDFS

CELI 2015 48

Page 49: Codemotion Rome 2015. GlusterFS

Scalability

CELI 2015 49

Page 50: Codemotion Rome 2015. GlusterFS

Under the hood

Elastic Hash Algorithm

No central metadata

No Performance Bottleneck

Eliminates risk scenarios

Location hashed intelligently on filename

Unique identifiers (GFID), similar to md5sum

Under the hood

CELI 2015 50

Page 51: Codemotion Rome 2015. GlusterFS

Scalability

3TB

3TB

3TB

3TB

3TB

Gluster Server

3TB

3TB

3TB

3TB

3TB

3TB

3TB

3TB

3TB

3TB

Gluster Server

3TB

3TB

3TB

3TB

3TB

3TB

3TB

3TB

3TB

3TB

Gluster Server

3TB

3TB

3TB

3TB

3TB

Scale out performance and availability

Sca

le o

ut c

apac

itry

Scalability

CELI 2014 51

Page 52: Codemotion Rome 2015. GlusterFS

Scalability

Add disks to servers to increase storage size

Add servers to increase bandwidth and storage size

Add servers to increase availability (replica factor)

Scalability

CELI 2015 52

Page 53: Codemotion Rome 2015. GlusterFS

What we do with glusterFS

CELI 2015 53

Page 54: Codemotion Rome 2015. GlusterFS

What we do with GFS

Daily production of more than 10GB of Lucene inverted indexes stored on glusterFS (more than 200GB/month)

Search stored indexes to extract different sets of documents for every customers

Analytics over large sets of documents

YES: we open indexes directly on storage (it's POSIX!!!)

What we do with GlusterFS

CELI 2015 54

Page 55: Codemotion Rome 2015. GlusterFS

What we do with GFS

Zillions of very small files: this kind of FS are not designed

to support this use case

What we can't do

CELI 2015 55

Page 56: Codemotion Rome 2015. GlusterFS

2010: first installation

Version 3.0.x

8 (not dedicated) servers

Distributed replicated

No bound on brick size (!!!!)

Ca 4TB avaliable

NOTE: stuck to 3.0.x until 2012 due to problems on 3.1 and

3.2 series, then RH acquired gluster (RH Storage)

2010: first installation

CELI 2015 56

Page 57: Codemotion Rome 2015. GlusterFS

2012: (little) cluster

New installation, version 3.3.2

4TB available on 8 servers (DELL c5000)

still not dedicated

1 brick per server limited to 1TB

2TB-raid 1 on each server

Still in production

2012: a new (little) cluster

CELI 2015 57

Page 58: Codemotion Rome 2015. GlusterFS

2012: enlarge

New installation, upgrade to 3.3.x

6TB available on 12 servers (still not dedicated)

Enlarged to 9TB on 18 servers

Bricks size bounded AND unbounded

2012: enlarge

CELI 2015 58

Page 59: Codemotion Rome 2015. GlusterFS

2013: fail

18 not dedicated servers: too much

18 bricks of different sizes

2 big down due to bricks out of space

2013: fail

CELI 2014 59

Page 60: Codemotion Rome 2015. GlusterFS

2013: fail

Didn’t restart after a move

but…

All data were recovered

(files are scattered on bricks, read from them!)

2014: BIG fail

CELI 2014 60

Page 61: Codemotion Rome 2015. GlusterFS

2013: fail

2 dedicated servers (DELL 720xd)

12 x 3TB SAS raid6

4 bricks per server

28 TB available

distributed replicated

4x1Gb bonded NIC

ca 40 clients (FUSE) (other servers)

2014: consolidate

CELI 2014 61

Page 62: Codemotion Rome 2015. GlusterFS

Consolidate

Gluster Server 1

brick 4

brick 3

brick 2

brick 1

Gluster Server 2

brick 4

brick 3

brick 2

brick 1

Consolidate

CELI 2015 62

Page 63: Codemotion Rome 2015. GlusterFS

ConsolidateOne year trend

CELI 2015 63

Page 64: Codemotion Rome 2015. GlusterFS

2013: fail

Buy a new server

Put in the rack

Reconfigure bricks

Wait until Gluster replicate data

Maybe rebalance the cluster

2015: scale up

CELI 2014 64

Page 65: Codemotion Rome 2015. GlusterFS

Scale up

Gluster Server 1

brick 31

brick 13

brick 12

brick 11

Gluster Server 2

brick 24

brick 32

brick 22

brick 21

Gluster Server 3

brick 14

brick 23

brick 32

brick 31

Scale up

CELI 2015 65

Page 66: Codemotion Rome 2015. GlusterFS

Do

Dedicated server (phisical or virtual)

RAID 6 or RAID 10 (with small files)

Multiple bricks of same size

Plan to scale

Do

CELI 2015 66

Page 67: Codemotion Rome 2015. GlusterFS

Do not

Multi purpose server

Bricks of different size

Very small files (<10Kb)

Write to bricks

Do not

CELI 2015 67

Page 68: Codemotion Rome 2015. GlusterFS

Some raw tests

read

Total transferred file size: 23.10G bytes

43.46M bytes/sec

write

Total transferred file size: 23.10G bytes

38.53M bytes/sec

Some raw tests

CELI 2015 68

Page 69: Codemotion Rome 2015. GlusterFS

Some raw tests Transfer rate

CELI 2015 69

Page 70: Codemotion Rome 2015. GlusterFS

Raw tests

NOTE: ran in production under heavy load, no clean test room

Raw tests

CELI 2015 70

Page 71: Codemotion Rome 2015. GlusterFS

Resources

http://www.gluster.org/

https://access.redhat.com/documentation/en-US/Red_Hat_Storage/

https://github.com/gluster

http://www.redhat.com/products/storage-server/

http://joejulian.name/blog/category/glusterfs/

http://jread.us/2013/06/one-petabyte-red-hat-storage-and-glusterfs-project-

overview/

Resources

CELI 2015 71

Page 72: Codemotion Rome 2015. GlusterFS

CELI 2014 72

[email protected]

www.celi.it

Torino | Milano| TrentoThank You@robfrankie