highly-available lustre with srp-mirrored luns

34
Highly-Available Lustre with SRP- Mirrored LUNs UF HPC Center Research Computing University of Florida

Upload: jada

Post on 25-Feb-2016

40 views

Category:

Documents


1 download

DESCRIPTION

Highly-Available Lustre with SRP-Mirrored LUNs. UF HPC Center Research Computing University of Florida. HA Lustre. Design Goals Minimize Cost per TB Maximize Availability Good Performance (within cost constraints) Avoid External SAS/ Fibre Attached JBOD Avoid External RAID Controllers - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Highly-Available  Lustre  with SRP-Mirrored LUNs

Highly-Available Lustre with SRP-Mirrored LUNs

UF HPC CenterResearch ComputingUniversity of Florida

Page 2: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 2

HA Lustre Design Goals

Minimize Cost per TB Maximize Availability Good Performance (within cost constraints) Avoid External SAS/Fibre Attached JBOD Avoid External RAID Controllers Support Ethernet and InfiniBand clients Standard Components Open Source Software

9/28/11

Page 3: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 3

HA Lustre To Minimize Cost

Commodity storage chassis Internal PCIe RAID controllers Inexpensive, high-capacity 7200 rpm drives

Problem: How do we enable failover?

Solution: InfiniBand + SRP SCSI RDMA Protocol

9/28/11

Page 4: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 4

HA Lustre Problem

All storage is internal to each chassis No way for one server to take over the storage of

the other server in the event of a server failure Without dual-ported storage and external RAID

controllers how can one server take over the other’s storage?

Solution InfiniBand SCSI Remote/RDMA Protocol (SRP)

9/28/11

Page 5: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 5

HA Lustre InfiniBand

Low-latency, high-bandwidth interconnect Used natively for distributed memory applications

(MPI) Encapsulation layer for other protocols (IP, SCSI, FC,

etc.)

SCSI Remote Protocol (SRP) Think of it as SCSI over IB Provides a host with block-level access to storage

devices in another host. Via SRP host A can see host B’s drives and vice-versa

9/28/11

Page 6: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 6

HA Storage Host A can see host B’s storage and host B can see

host A’s storage but there’s a catch…

If host A fails completely, host B still won’t be able to access host A’s storage since host A will be down and all the storage is internal.

So SRP/IB doesn’t solve the whole problem.

But… what if host B had a local copy of Host A’s storage and vice-versa (pictures coming – stay tuned).

Think of a RAID-1 mirror, where the mirrored volume is comprised of one local drive and one remote (via SRP) drive

9/28/11

Page 7: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 7

HA Lustre Mirrored (RAID-1) Volumes

Two (or more) drives Data is kept consistent across

both/all drives Writes are duplicated to each

disk Reads can take place from

either/all disk(s)

9/28/11

Page 8: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 8

Remote Mirrors Not Possible?

9/28/11

Page 9: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 9

Remote Mirrors Remote targets exposed via SRP

9/28/11

Page 10: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 10

Remote Mirrors Mirroring Possibilities

9/28/11

Page 11: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 11

Remote Mirrors Normal Operating Conditions

9/28/11

Page 12: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 12

Remote Mirrors Host A is down

9/28/11

Page 13: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 13

Remote Mirrors Degraded mirrors on host B

9/28/11

Page 14: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 14

HA Lustre Hardware Configuration

Chenbro RM91250 Chassis (50 Drives, 9U) SuperMicro X8DAH System Board

PCIe Slots: 2 x16, 4 x8, 1 x4

Intel E5620 Processors (2) 24 GB RAM Adaptec 51245 PCI-E RAID Controller (4) (x8 slots) Mellanox MT26428 ConnectX QDR IB HCA (2) (x16 slot) Mellanox MT25204 InfiniHost III SDR IB HCA (1) (x4 slot)

9/28/11

Page 15: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 15

HA Lustre RAID Configuration

Adaptic 51245 (4) RAID-6 (4+2) (to stay below 8 TB LUN) 7.6 TiB per LUN 2 LUNs per controller 8 LUNs per OSS 60.8 TiB per OSS

9/28/11

Page 16: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 16

HA Lustre LVM2 Configuration

Encapsulate each LUN in an LV Identification

Convenience

LVs named by host, controller, LUN h<L>c<M>v<N>

h1c1v0, h1c1v1h1c2v0, h1c2v1h1c3v0, h1c3v1h1c4v0,h1c4v1

9/28/11

Page 17: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 17

HA Lustre MD (Mirror) Configuration

Mirror consists of 1 local and 1 remote LUN Host 1

/dev/<vg>/<lv>: /dev/h1c1v0/h1c1v0 (local) /dev/h2c1v0/h2c1v0 (remote)

Device: /dev/md/ost0000

Host 2 /dev/<vg>/<lv>: /dev/h1c1v1/h1c1v1 (remote)

/dev/h2c1v1/h2c1v1 (local)

Device: /dev/md/ost0004

9/28/11

Page 18: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 18

HA LustreHost 1

LVsmd100 = h1c1v0 + h2c1v0md101 = h1c2v0 + h2c2v0md102 = h1c3v0 + h2c3v0md103 = h1c4v0 + h2c4v0

OSTsost0000 = md100ost0001 = md101ost0002 = md102

ost0003 = md1039/28/11

Host 2

LVsmd104 = h1c1v1 + h2c1v1md105 = h1c2v1 + h2c2v1md106 = h1c3v1 + h2c3v1md107 = h1c4v1 + h2c4v1

OSTsost0004 = md104ost0005 = md105ost0006 = md106

ost0007 = md107

Page 19: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 19 9/28/11

Page 20: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 20 9/28/11

Page 21: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 21

HA Lustre High-Availability Software (Open Source)

Corosync Pacemaker

Corosync Membership Messaging

Pacemaker Resource monitoring and management framework Extensible via Resource agent templates Policy Engine

9/28/11

Page 22: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 22

HA Lustre Corosync Configuration

Dual Rings Back-to-Back ethernet

IPoIB via SRP IB Interface

clear_node_high_bit: yes rrp_mode: passive rrp_problem_count_threshold: 20 retransmits_before_loss: 6

9/28/11

Page 23: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 23

HA Lustre Pacemaker Configuration

Resources Stonith (modified to control multiple smart pdus)

MD (custom)

Filesystem (stock)

Resource Groups (managed together) One per OST (grp_ostNNNN)

MD + File system

Not LVs – some disappear if a node goes down

9/28/11

Page 24: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 24

HA Lustre Performance

4 PCI-E RAID Controllers per Server 2 RAID-6 (4+2) Logical Disk per Controller

8 Logical Disks per Server (4 local, 4 remote)

490 MB/sec per Logical Disk

650 MB/sec per Controller (parity limited)

Three IB Interfaces per Server IB Clients (QDR, Dedicated)

IPoIB Clients (SDR, Dedicated)

SRP Mirror Traffic (QDR, Dedicated)9/28/11

Page 25: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 25

HA Lustre Performance ( continued)

Per Server Throughput 1.1 GB/sec per server (writes – as seen by clients)

1.7 GB/sec per server (reads – as seen by clients)

Actual server throughput is 2x for writing (mirrors!)

That’s 2.2 GB/s per Server 85% of the 2.6 GB/s for the raw storage

9/28/11

Page 26: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 26

HA Lustre Performance – Didn’t come easy

Defaults for everything, no mirroring Default PV alignment (??)

RAID stripe unit size ( 256 KB)

aacraid max_hw_sectors_kb (256 KB, controlled by acbsize)

MD device max_sectors_kb (128 KB)

Lustre max RPC size (1024 KB)

Per-OST streaming throughput, no mirroring Ugh!

Reads: ~253 MB/s

Writes: ~173 MB/s

9/28/11

Page 27: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 27

HA Lustre Performance – Didn’t come easy

Alignment PVs to RAID stripe boundary Streaming reads: ~333 MB/s

Streaming writes: ~280 MB/s

Increase MD max I/O = RAID stripe size = aacraid max I/O Required patch to MD RAID1 module (hardwired)

Only improved streaming reads: ~360 MB/s

Increase max I/O size (MD + aacraid) => 512KB aacraid acbsize=4096 (driver unstable beyond 4096)

Streaming writes: ~305MB/s

Could not reach a 1MB max I/O size9/28/11

Page 28: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 28

HA Lustre Performance – Didn’t come easy

Introduce SRP Mirrors… Lustre RPC size = aacraid max I/O =

SRP target RDMA size = MD max I/O = 512 KB Per-OST streaming reads: ~433 MB/s

Improvement via MD read balancing

Per-OST streaming writes: ~280 MB/s Slight penalty with SRP – can be CPU-bound on the

core that handles the SRP HCA interrupts

Slightly faster OSS CPU would presumably help this

9/28/11

Page 29: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 29

HA Lustre Performance – Summary

HA OSS (4 SRP-mirrored OSTs total) Streaming writes: 1.1 GB/s (i.e 2.2 GB/s) 85% of sgpdd-survey result Reads: 3.4 GB/s (per pair)

1.7 GB/s observed from each HA OSS

Considerable improvement over defaults

9/28/11

Page 30: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 30

HA Lustre Keeping the data safe

Mirrors enable failover Provide a second copy of the data Each Mirror

Hardware RAID

RAID-6 (4+2), two copies of parity data

Servers protected by UPS Orderly shutdown of servers in the event of a sudden

power outage.

3+1 Redundant power supplies each to a different UPS.

9/28/11

Page 31: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 31

HA Lustre Problems Encountered

Unstable SRP Target: OFED SRP target proved unstable Used SCST SRP target (started w/ pre 2.0 release)

MD Mirror Assembly May choose wrong mirror under corosync.

Could not duplicate outside of corosync control

Requires deactivating the out-of-sync volume, assembling the degraded mirror, then adding the out-of-sync volume. Not ideal

Poor Initial Performance Resolved through tuning (described previously)

9/28/11

Page 32: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 32

HA Lustre Problems Encountered (continued)

Zone Allocator killed us Blocked monitoring agents led to many needless

remounts and sometimes STONITH events Could not pinpoint the problem which often but not

always seemed correlated with load Seems we were the last to know about the long delays

caused by the zone allocator Many timeout parameters unnecessarily adjusted to be

very loooong. vm.zone_reclaim_mode = 0 100% stable now

9/28/11

Page 33: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 33

HA Lustre Future Improvements

SSD cache (i.e Adaptec maxCache) External journal device 6 Gbps RAID cards capable of > 512KB I/Os Faster processor (for SRP interrupt handling) 8+2 RAID-6 OSTs

More efficient disk utilization (4/5 vs 2/3)

Affects chassis and backplane choices

9/28/11

Page 34: Highly-Available  Lustre  with SRP-Mirrored LUNs

UF Research Computing 34

HA Lustre

Thank You Questions or Comments?

9/28/11