sfa12kx and lustre update - hpc advisory council and lustre update sep 2014 ... ddn confidential –...

26
ddn.com ©2012 DataDirect Networks. All Rights Reserved. SFA12KX and Lustre Update Sep 2014 Maria Perez Gutierrez HPC Advisory Council HPC Specialist

Upload: lethuan

Post on 11-May-2018

214 views

Category:

Documents


1 download

TRANSCRIPT

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

SFA12KX and Lustre Update

Sep 2014

Maria Perez Gutierrez HPC Advisory Council HPC Specialist

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Agenda

▶  SFA12KX – Features update • Partial Rebuilds • QoS on reads

▶  Lustre metadata performance update

2

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

SFA12KX Features Update

3

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Big Data & Cloud Infrastructure DDN FY14 Focus Areas

Analytics Reference Architectures

EXAScaler™ 10Ks of Clients 1TB/s+, HSM

Linux HPC Clients NFS

Petascale Lustre® Storage

Enterprise Scale-Out File Storage

GRIDScaler™ ~10K Clients 1TB/s+, HSM

Linux/Windows HPC Clients NFS & CIFS

SFA12KX/E 40GB/s/1.7M IOPS 1,680 Drives supported Embedded Computing

SFA7700 13GB/s, 600K IOPS •  7700x •  7700 E*

Storage Fusion Architecture™ Core Storage Platforms

SATA SSD

Flexible Drive Configuration SAS

SFX Automated Flash Caching •  Read •  Context Commit

•  Instant Commit

WOS® 3.0 32 Trillion Unique Objects

Geo-Replicated Cloud Storage 256 Million Objects/Second

Self-Healing Cloud Parallel Boolean Search

Cloud Foundation

Big Data Platform Management

DirectMon

Cloud Tiering

* GA = Future Release

Infinite Memory Engine™ Distributed File System Buffer Cache* [Demo]

WOS7000 60 Drives in 4U

Self-Contained Servers

S3/Swift

WolfCreek

Highlights •  Platform

•  SFA Hardening

•  Higher speed Embedded

•  WolfCreek •  7700E* •  7700x*

•  Full speed on IME •  PFS

acceleration •  More use

cases under review

•  WOS •  S3/Swift •  WOS

Access •  Cost

reduction •  Performance

improvements

FY14

FY14

FY14

DDN Confidential – NDA Required, Roadmap Subject to Change

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

SFA12KX | GA: Q2 2014

SFA 12K Family addition: SFA12KX High-Scale Active/Active Block Storage Appliance, Available May 2014

Specifications

Appliances Active/Active Block Controllers

CPU Dual Socket Intel 8Core Ivy-Bridge

Memory & Battery-Backed Cache

128GB DDR3 1866 64GB Mirrored

SFX Flash cache Up to 12.8 TB Write Intensive SSDs

RAID Levels RAID 1/5/6

IB Host Connectivity 16 x 56Gb (FDR)

FC Host Connectivity 32 x 16Gb/s

Drive Support 2.5” and 3.5” SAS, SATA, SSD

Max JBOD Expansion Up to 20

Supported JBODs SS7000, SS8460

SFA12KX with 20x SS8460 Enclosures

41

48

30

40

50

Infiniband SRP

Ban

dwid

th (G

B/s

) Write* Read*

* Estimated Performance

* Large Block, Full Stripe ** Read I/O Includes Parity Verification

DDN Confidential – NDA Required, Roadmap Subject to Change

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Introducing 12KX (Q2 2014)

0 5,000

10,000 15,000 20,000 25,000 30,000 35,000 40,000 45,000

1 8 16 24 32 40 48 56

Rat

e (M

B/s

)

Pool Count

Sequential Write Throughput

64K I/O Size

256K I/O Size

512K I/O Size

1M I/O Size

2M I/O Size

4M I/O Size

8M I/O Size

ü  Over 40 GB/s Reads AND Writes

ü  Over 20 GB/s MIRRORED Writes

ü  Linear Scalability ü  SFX Ready ü  Latest Intel

Processor ü  32 Ports FC-16

0

5,000

10,000

15,000

20,000

25,000

1 8 16 24 32 40 48 56

Rat

e (M

B/s

)

Pool Count

Random Write Throughput

512K I/O Size

1M I/O Size

2M I/O Size

4M I/O Size

8M I/O Size

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

SFA Feature Baseline

High Density Array Up to 840 HDDs in a

single rack. 84 HDDs / SSDs in 4U

Storage Fusion Fabric Highly-Over-Provisioned Backend - RAIDed Fabric

Withstands Failures of Drives, Enclosures, Cabling, etc

Reliabiltiy, Availability & Monitoring - Mgmt Raid 10 (8+2) Fast Pool Rebuilds

partial rebuilds Rebuild priority adjustable SSD Life-counter

Priority , Queuing, Real-Time QoS for Read Operations; Critical for Striped File Performance Consistency ReACT IO Fairness Lowest Latency for Small IO Highest Bandwidth for Large & Streaming I/O Dialed-in Host I/O priority during rebuilds

Performance , Flash Real-Time, Multi-CPU RAID Engine Interrupt-Free; Massively Parallel I/O Delivery System RAID Rebuild Acceleration SFX Cache, Automated, data-driven caching system enabling hybrid SSD & HDD systems

Data Integrity & Security DirectProtect™ RAID1/5/6 Real-Time Data Integrity Verification for every I/O SED management Data at-rest encryption of all user data. Instant Crypto Erase

DDN Confidential – NDA Required, Roadmap Subject to Change

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

SFA Partial Rebuild Example persistent partial rebuild

1

Complete enclosure

removed for F/W upgrade.

2 Controllers send I/O destined

for failed enclosure to available drives. Holds

corresponding metadata in synchronous mirrored cache.

I/O I/O

3

Controller 1 fails. I/O continues.

I/O

4

Complete system outage due to power failure.

controller 2 writes cache to onboard

drives.

5 Power restored.

Controller cache restored. Upgraded Enclosure

undergoes partial rebuild of cached data in minutes.

I/O

84 * 4TB Disks (336TB) removed for hours, rebuilt in minutes

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

SFA Quality of Service Read retry timeouts coupled with DirectProtect DIF

Raid6 (8+2) of NL-SAS Disks, one of the member has ~100% higher avg latencies

than others. Production not impacted thanks to

DDN’s QoS

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Lustre Metadata Benchmark and Performance / How to scale on Lustre metadata performance

10

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Lustre Metadata Performance

▶  Lustre metadata is a crucial performance metric for many Lustre user •  LU-56 SMP Scaling (Lustre-2.3) • DNE (Lustre-2.4)

▶  Metadata performance is related to small file performance on Lustre

▶  But, metadata performance is still a little mysterious J • Performance differentiation by metadata type and access patterns? • What is the impact of hardware resources for metadata

performance? ▶  This presentation: use standard metadata benchmark tools

to analyze metadata performance on Lustre today

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Lustre Metadata Benchmark Tools

▶  mds-survey • Build into Lustre code • Similar to obdfilter-suvey • Generates loads on MDS to simulate Lustre metadata performance

▶  mdtest • Major metadata benchmark tool used by many large HPC sites • Runs on clients using MPI • Several metadata operation and access patterns are supported

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Single Client Metadata Performance Limitation ▶  Single client Metadata performance does not scale with threads.

/** * Serializes in-flight MDT-modifying RPC requests to preserve idempotency. * * This mutex is used to implement execute-once semantics on the MDT. * The MDT stores the last transaction ID and result for every client in * its last_rcvd file. If the client doesn't get a reply, it can safely * resend the request and the MDT will reconstruct the reply being aware * that the request has already been executed. Without this lock, * execution status of concurrent in-flight requests would be * overwritten. * * This design limits the extent to which we can keep a full pipeline of * in-flight requests from a single client. This limitation could be * overcome by allowing multiple slots per client in the last_rcvd file. */ struct mdc_rpc_lock { /** Lock protecting in-flight RPC concurrency. */ struct mutex rpcl_mutex; /** Intent associated with currently executing request. */ struct lookup_intent *rpcl_it; /** Used for MDS/RPC load testing purposes. */ int rpcl_fakes; };

▶  LU-5319 supports multiple slots per client in last_rcvd file (Under development by Intel and Bull).

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

ops/

sec

Number of Threads

Single Client Metadata Performance (Unique)

File creation File stat File removal

lustre/include/lustre_mdc.h

Client can send many metadata requests to MDS simultaneously, but MDS needs to store each client's last transaction ID and it's serialized.

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Modified mdtest for Lustre

▶  Basic Function • Supports multiple mount points on a single client • Helps generating heavy metadata load from single client

▶  Background • Originally developed by Liang Zhen for LU-56 work • We rebased and cleaned up codes and made few enhancements

▶  Enables metadata benchmarks on a small number of clients • Regression testing • MDS server sizing • Performance optimization

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Performance Comparison

Single Lustre client mounts /lustre_0, /lustre_1, .... /lustre_31 for single filesystem

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

ops/

sec

Number of Threads

Single Client Metadata Performance (Unique, multi-

mountpoints) File creation File stat File removal

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

ops/

sec

Number of Threads

Single Client Metadata Performance (Unique, single

mountpoint) File creation File stat File removal

# mdtest –n 10000 –u –d /lustre_{0-15}

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

RP1 RP0 RP1 RP0

Benchmark Configuration

MDS OSS

Client

2 x MDS(2 x E5-2676v2, 128GB memory) 4 x OSS(2 x E5-2680v2, 128GB memory) 32 x Client(2 x E5-2680, 128GB memory) SFA12K-40 400 x NL-SAS for OST 8 x SSD for MDT Lustre-2.5 for Servers Lustre-2.6.52, Lustre-1.8.9 for Client

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Metadata Benchmark Method

▶  Tested Metadata Operations • Directory/File Creation • Directory/File Stat • Directory/File Removal

▶  Access patterns • To Unique Directory and shared directory o P0 -> /lustre/Dir0/file.0.0, P1 -> /lustre/Dir1/file.0.1 (Unique) o P0 -> /lustre/Dir/file.0.0, P1 -> /lustre/Dir/file.1.0 (Shared)

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Lustre Metadata Performance Impact MDS's CPU speed ▶  Metadata Performance comparison (Unique Directory)

•  32 clients(1024 mount points), 1024 processes, 1.28M Files •  Tested on 16 CPU cores with 2.1, 2.5, 2.8, 3.3 and 3.6GHz CPU Speed (MDS)

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz

Directory Operation(Unique) Dir Creation Dir Stats Dir Removal

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz

File Operation(Unique) File Creation Fie Stats File Removal

38% 20%

60%

38%

70%

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Lustre Metadata Performance Impact MDS's CPU speed ▶  Metadata Performance comparison (Shared Directory)

•  32 clients(1024 mount points), 1024 processes, 1.28M Files •  Tested on 16 CPU cores with 2.1, 2.5, 2.8, 3.3 and 3.6GHz CPU Speed (MDS)

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz

Directory Operation(Shared) Dir Creation Dir Stats Dir Removal

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

2.1GHz 2.5GHz 2.8GHz 3.3GHz 3.6GHz

File Operation(Shared) File Creation Fie Stats File Removal

30%

58%

22% 50%

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Lustre Metadata Performance Impact MDS's CPU Cores ▶  Metadata Performance comparison (Unique Directory)

•  32 clients(1024 mount points), 1024 processes, 1.28M Files •  Tested on 3.3GHz CPU speed with 8, 12 and 16 CPU cores w/wo logical processors

0%

50%

100%

150%

200%

250%

Directory Operation(Unique) Dir Creation Dir Stats Dir Removal

0%

50%

100%

150%

200%

250%

File Operation(Unique) File Creation Fie Stats File Removal

100%

25%

80-120%

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Lustre Metadata Performance Impact MDS's CPU Cores

0%

20%

40%

60%

80%

100%

120%

140%

160%

Directory Operation(Shared) Dir Creation Dir Stats Dir Removal

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

File Operation(Shared) File Creation Fie Stats File Removal

▶  Metadata Performance comparison (Shared Directory) •  32 clients(1024 mount points), 1024 processes, 1.28M Files •  Tested on 3.3GHz CPU speed with 8, 12 and 16 CPU cores w/wo logical processors

Creation(Sshare) and Stat do not scale

60%

No scale on 12->16CPU

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Lustre Metadata Performance MDSs Scalability (Unique Directory)

0

50000

100000

150000

200000

250000

16 32 64 128 256 512 1024

ops/

sec

Number of Threads

Lustre Metadata Scalability (Unique) (32 clients, 1024 mount points)

File Creation

File Removal

Dir Creation

Dir Removal

File Creation(DNE)

File Removal(DNE)

100% by DNE

50% of File creation

multiple mount points

Directory Creation

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Lustre Metadata Performance MDSs Scalability (Shared Directory)

0

10000

20000

30000

40000

50000

60000

70000

80000

90000

100000

16 32 64 128 256 512 1024

ops/

sec

Number of Threads

Lustre Metadata Scalability (Shared) (32 clients, 1024 mount points)

File Creation

File Removal

Dir Creation

Dir Removal

80% of Performance compared to UniqueDir patterns

Directory Creation

multiple mount points

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Lustre Metadata Performance File creation and removal for small files

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 4096 8192 16384 32768 65536 131072

ops/

sec

File Size(byte)

Small File Performance (Unique Directory)

(32 clients, 1024 mount points) File Creation File Read File Removal

0

20000

40000

60000

80000

100000

120000

140000

160000

180000

0 4096 8192 16384 32768 65536 131072

ops/

sec

File Size(byte)

Small File Performance (Shared Directory)

(32 clients, 1024 mount points) File Creation File Read File Removal

Creating files with actual file size (4K, 8K, 16K, 32K, 64K and 128K) (Stripe Count=1)

Small file performance bounds on metadata performance, but no performance impacts with file size.

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Summary Observations

▶  MDS Server resources significantly affect Lustre Metadata performance • Performance scales well by number of CPU core and CPU Speed in

unique directory access, but not CPU bound for shared directory access pattern

• Collected baseline results with 16 CPU cores, but need more tests on CPU cores

▶  Performance is highly dependent on metadata access pattern • Example: Directory Creation vs. File Creation • With actual file size (instead of zero byte), less impact in the case of a

small number of OST(e.g. up to 40 OST), but testing on large number of OSTs is needed

ddn.com ©2012 DataDirect Networks. All Rights Reserved.

Metadata Performance: Future Work

▶  Known Issues and Optimizations • Client-side metadata optimization and especially single-client

metadata performance • Various performance regressions in Lustre 2.5/2.6 that need to be

addressed (e.g. LU5608)

▶  Areas of Future Investigation • Real-world metadata use scenarios and metadata problems • Real-world small-file performance (e.g. life sciences) •  Impact of OST data structures on real world metadata performance • DNE scalability on very large systems with many MDSs/MDTs and

many OSSs/OSTs