adic / caspur / cern / datadirect / enea / ibm / rz garching / sgi new results from caspur storage...

45
ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

Upload: isaac-costello

Post on 28-Mar-2015

219 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI

New results from CASPUR Storage Lab

Andrei MaslennikovCASPUR Consortium

May 2004

Page 2: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

2

Participated:

ADIC Software : E.Eastman

CASPUR : A.Maslennikov(*), M.Mililotti, G.Palumbo

CERN : C.Curran, J.Garcia Reyero, M.Gug, A.Horvath, J.Iven, P.Kelemen, G.Lee, I.Makhlyueva, B.Panzer-Steindel, R.Többicke, L.Vidak

DataDirect Networks : L.Thiers

ENEA : G.Bracco, S.Pecoraro

IBM : F.Conti, S.De Santis, S.Fini

RZ Garching : H.Reuter

SGI : L.Bagnaschi, P.Barbieri, A.Mattioli

(*) Project Coordinator

Page 3: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

3

Sponsors for these test sessions:

ACAL Storage Networking : Loaned a 16-port Brocade switch

ADIC Soiftware : Provided the StorNext file system product, actively participated in tests

DataDirect Networks : Loaned an S2A 8000 disk system, actively participated in tests

E4 Computer Engineering : Loaned 10 assembled biprocessor nodes

Emulex Corporation : Loaned 16 fibre channel HBAs

IBM : Loaned a FASTt900 disk system and SANFS product complete with 2 MDS units,

actively participated in tests

Infortrend-Europe : Sold 4 EonStor disk systems at discount price

INTEL : Donated 10 motherboards and 20 CPUs

SGI : Loaned the CXFS productStorcase : Loaned an InfoStation disk system

Page 4: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

4

Contents

• Goals• Components under test• Measurements:

- SATA/FC systems - SAN File Systems

- AFS Speedup - Lustre (preliminary) - LTO2

• Final remarks

Page 5: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

5

1. Performance of low-cost SATA/FC disk systems

2. Performance of SAN File Systems

3. AFS Speedup options

4. Lustre

5. Performance of LTO-2 tape drive

Goals for these test series

Page 6: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

6

Disk systems: 4x Infortrend EonStor A16F-G1A2 16 bay SATA-to-FC arrays: Maxtor Maxline Plus II 250 GB SATA disks (7200 rpm)

Dual Fibre Channel outlet at 2 Gbit Cache: 1 GB 2x IBM FAStT900 dual controller arrays with SATA expansion units: 4 x EXP100 expansion units with 14 Maxtor SATA disks of the same type

Dual Fibre Channel outlet at 2 Gbit Cache: 1 GB 1x StorCase InfoStation 12 bay array:

same Maxtor SATA disksDual Fibre Channel outlet at 2 Gbit

Cache: 256 MB

1x DataDirect S2A 8000 System: 2 controllers with 74 FC disks of 146GB

8 Fibre Channel outlets at 2 Gbit Cache: 2.56 GB

Components

Page 7: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

7A.Maslennikov - May 2004 - SLAB update

Infortrend EonStor A16F-G1A2

- Two 2Gbps Fibre Host Channels- RAID levels supported: RAID 0, 1 (0+1), 3, 5, 10, 30, 50, NRAID and JBOD - Multiple arrays configurable with dedicated or global hot spares- Automatic background rebuild- Configurable stripe size and write policy per array- Up to 1024 LUNs supported- 3.5", 1" high 1.5Gbps SATA disk drives- Variable stripe size per logical drive- Up to 64TB per LD- Up to 1GB SDRAM

Page 8: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

8A.Maslennikov - May 2004 - SLAB update

FAStT900 Storage Server

- 2 Gbps SFP

- Expansion units: EXP700 FC / EXP100 sATA

- Four SAN (FW-SW), or eight direct (FC-AL)

- Four (redundant) 2 Gbps drive channels

- Capacity: min 250GB – max 56TB (14 disks x EXP100 sATA)

min 32GB – max 32TB (14 disks x EXP700 FC)

- Dual-active controllers

- Cache: 2GB

- RAID support 0, 1, 3, 5, 10

FAStT900

EXP100

Page 9: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

9A.Maslennikov - May 2004 - SLAB update

STORCase Fibre-to-SATA

- SATA and Ultra ATA/133 Drive Interface

- 12 hot swappable drives

- Switched or FC-AL host connections

- RAID levels: 0, 1, 0+1, 3, 5, 30, 50 and JBOD

- Dual Fibre 2Gbps host ports

- Support up to 8 arrays and 128 LUNs

- Up to 1GB PC200 DDR cache memory

Page 10: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

10A.Maslennikov - May 2004 - SLAB update

DataDirect S²A8000

- Single 2U S2A8000 with Four 2Gb/s Ports or Dual 4U

with Eight 2Gb/s Ports

- Up to 1120 Disk Drives; 8192 LUNs supported

- 5TB to 130TB with FC Disks, 20TB to 250TB with SATA disks

- Sustained Performance well over 1GB/s (1.6 GB/s theoretical)

- Full Fibre-Channel Duplex Performance on every port - PowerLUN™ 1 GB/s+ individual LUNs without host-based striping - Up to 20GB of Cache, LUN-in-Cache Solid State Disk functionality - Real time Any to Any Virtualization - Very fast rebuild rate

Page 11: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

11

- High-end Linux units for both servers and clients Biprocessor Pentium IV Xeon 2.4+ GHz, 1GB RAM Qlogic QLA2300 2Gbit or Emulex LP9xxx Fibre Channel HBAs - Network 2x Dell 5224 GigE switches - SAN Brocade 3800 switch – 16 ports (test series 1) Qlogic Sanbox 5200 – 32 ports (test series 2)

- Tapes 2x IBM Ultrium LTO2 (3580-TD2, Rev: 36U3 )

Components

Page 12: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

12A.Maslennikov - May 2004 - SLAB update

Qlogic SANbox 5200 Stackable Switch

- 8, 12 or 16 auto-detecting 2Gb/1Gb device ports with 4-port incremental upgrade - Stacking of up to 4 units for 64 available user ports

- Interoperable with all FC SW-2 compliant Fibre Channel switches- Full-fabric, public-loop or switch-to-switch connectivity on 2Gb or 1Gb front ports - "No-Wait" routing - guaranteed maximum performance independent of data traffic - Support traffic between switches, servers and storage at up to 10Gb/s

- Low cost: 5200/16p is at least twice less expensive than Brocade 3800/16p - May be upgraded in 8p steps

Page 13: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

13A.Maslennikov - May 2004 - SLAB update

IBM LTO Ultrium 2 Tape Drive Features

- 200 GB Native Capacity (400 GB compressed)

- 35 MB/s native (70 MB/s compressed)

- Read/Write LTO 1 Cartridge

- Native 2Gb FC Interface

- Backward read/write with Ultrium 1 cartridge

- 64 MB buffer (vs 32 MB buffer in Ultrium 1)

- Speed Matching, Channel Calibration

- 512 Tracks vs. 384 Tracks in Ultrium 1

- 64 MB Buffer vs. 32 MB in Ultrium 1

- Enhanced Capacity (200GB)- Enhanced Performance (35 MB/s)- Backward Compatible- Faster Load/Unload Time, Data Access Time, Rewind Time

Page 14: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

14

SATA / FC Systems

Page 15: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

15

Typical array features: - single o dual (active-active) controller - up to 1GB of Raid Cache - battery to keep the cache afloat during power cuts - 8 through 16 drive slots - cost: 4-6 KUSD per 12/16 bay unit (Infortrend, Storcase)

Case and backplane directly impact on the disks’ lifetime: - protection against inrush currents - protection against the rotational vibration - orientation (H better than V – remark by A.Sansum) Infortrend EonStor: well engineered (removable controller module, lower vibration, H orientation) Storcase: special protection against inrush currents (“soft-start” drive power circuitry), low vibration

SATA / FC Systems – hw details

Page 16: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

16

High capacity ATA/SATA disk drives: - 250GB (Maxtor, IBM), 400GB (Hitachi) - RPM: 7200 - improved quality: warranty 3 years, component design lifetime : 5 years

CASPUR experience with Maxtor drives: - In 1.5 years lost 5 drives out of ~100, 2 of which due to power cuts - Factory quality for recent Maxtor Maxline Plus II 250 GB disks: out of 66 disks purchased, 4 were shortly replaced. Others stand the stress very well

Learned during this meeting: - RAL annual failure rate is 21 out of 920 Maxtor Maxline drives

SATA / FC Systems – hw details

Page 17: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

17

SATA / FC Systems – test setup

Parameters to select / tune: - stripe size for RAID-5 - SCSI queue depth on controller and on Qlogic HBAs - number of disks per logical drive

In the end, we were working with RAID-5 LUNs composed of 8 HDs each Stripe size: 128K (and 256K, in some tests)

4x IFT A16F- G1A2

4x IBM FASTt 900

Storcase Infostation

Qlogic2x 5200

16 2x2.4+ GHz NodesQlogic 2310F HBA Dell 5224

Page 18: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

18

Kernel settings: - Kernels: 2.4.20-30.9smp, 2.4.20-20.9.XFS1.3.1smp - vm.bdflush: “2 500 0 0 500 1000 20 10 0” - vm.max(min)-readahead: 256(127) (large streaming writes) 4(3) (random reads with small blksize)

File Systems: - EXT3 (128k RAID-5 stripe size): fs options: “-m O –j –J size=128 –R stride=32 –T largefile4” mount options: “data=writeback”

- XFS 1.3.1 (128k RAID-5 stripe size): fs options: “-i size=512 –d agsize=4g,su=128k,sw=7,unwritten=0 –l su=128k” mount options: “logbsize=262144,logbufs=8”

SATA / FC tests – kernel and fs details

Page 19: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

19

Large serial writes and reads: - “lmdd” from “lmbench” suite: http://sourceforge.net/projects/lmbench typical invocation: lmdd of=/fs/file bs=1000k count=8000 fsync=1

Random reads: - Pileup benchmark ([email protected]) designed to emulate the disk activity for multiple data analysis jobs 1) series of 2GB files are being created in the desination directory 2) these files are then being read in a random way, in many threads

SATA / FC tests – benchmarks used

Page 20: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

20

EXT3 results – filling 1.7 TB with 8GB files IFT systems show anomalous behaviour with EXT3 file system: performancevaries along the file system. The effect visibly depends on the RAID-5 stripe size:

SATA / FC results

32K

128k

256K

! The problem was reproduced and understood by Infortrend

New firmware is due in July

Page 21: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

21

IBM FAStT and Storcase behave in a more predictable manner with EXT3.Both these systems may however lose up to 20% in performance along thefile system:

SATA / FC results

Page 22: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

22

XFS results – filling 1.7 TB with 8GB files The situation changes radically with this file system. The curves are now becomingalmost flat, everything is much faster compared with EXT3:

SATA / FC results

IBM STORCASE INFORTREND

Infortrend and Storcase show compatible write speeds of about 135-140 MB/sec,IBM is much slower on writes (below 100 MB/sec).

Read speeds are visibly higher thanks to the read-ahead function of controller(IBM and IFT systems had 1 GB of raid cache, Storcase had only 256 MB)

Page 23: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

23

Pileup tests: These tests were done only on IFT and Storcase systems. Results to a largeextent depend on the number of threads that access the previously preparedfiles (after a certain number of threads performance may drop since the testmachine’s may have problems to handle many threads at a time).

The best result was obtained with the Infortrend array for XFS file system:

SATA / FC results

Number of threads

EXT3, MB/sec XFS, MB/sec

Storcase Infortrend Storcase Infortrend

4 3.7 3.8 9.5 12.1

8 4.4 4.4 10.3 16.8

16 4.4 4.7 12.0 19.3

32 4.5 4.8 12.6 17.9

64 4.4 4.7 11.0 15.9

Page 24: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

24

Operation in degraded mode:

We have tried it on a single Infortrend LUN of 5HDs and EXT3. One of the disks was removed, and rebuild process was started.

The Write speed went down from 105 to 91 MB/secThe Read speed went down from 105 to 28 MB/sec and even less

SATA / FC results

Page 25: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

25

1) The recent low-cost SATA-to-FC disk arrays (Infortrend, Storcase) operate very well and are able to deliver excellent I/O speeds far exceeding that of Gigabit Ethernet. Cost of such systems may be as low as 2.5 USD/rawGB. Quality of these systems is dominated by the quality of SATA disks.

2) The choice of local file system is fundamental. XFS easily outperforms EXT3.

In one occasion we have observed an XFS hang under a very heavy load. “xfs_repair” was run, and the error had never reappeared again. We are now planning to investigate this in deep. CASPUR AFS and NFS servers are all XFS-based, and there was only one XFS-related problem since we have put XFS in production 1.5 years ago. But probably we were simply lucky.

SATA / FC results - conclusions

Page 26: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

26

SAN File Systems

Page 27: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

27

SAN FS Placement These advanced distributed file systems allow clients to operate directly with block devices (block-level file access). Metadata traffic: via GigE. Required: Storage Area Network.

Current cost of a single fibre channel connection > 1000 USD: Switch port, min ~ 500 USD including GBIC Host Based Adapter, min ~ 800 USD

Special discounts for massive purchases are not impossible, but it is very hard to imagine that the cost of connection will become less than 600-700 USD in the close future..

SAN FS with native fibre channel connection is still not an option for large farms. SAN FS with iSCSI connection may be re-evaluated in combination with new iSCSI-SATA disk arrays.

SAN File Systems

Page 28: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

28

Where SAN File Systems with FC connection may be used: 1) High Performance Computing – fast parallel I/O, faster sequential I/O

2) Hybrid SAN / NAS systems: relatively small number of SAN clients acting as (also redundant) NAS servers

3) HA Clusters with file locking : Mail (shared pool), Web etc

SAN File Systems

Page 29: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

29

So far, we have tried these products: 0) Sistina GFS (see our 2002 and 2003 reports) 1) ADIC StorNext File System 2) IBM SANFS (StorTank) (preliminary, continue looking into it)

3) SGI CXFS (work in progress)

SAN File Systems

Page 30: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

30

FS PlatformsMDS host

required

MAX FS

size

GFSServer-Client: Linux32/64 No 2 TB

StorNextServer-Client: Aix, Linux, Solaris, Irix, Windows

No petabytes

StorTankServer: Linux32 Client: Aix, Linux, Windows, Solaris

Yes petabytes

CXFS

Server: Irix/Linux64

Client: Irix, Solaris, Aix, Windows, Linux, OsX

YesEsabytes

Linux32: 2TB

SAN File Systems

Page 31: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

31

What was measured (StorNext and StorTank): 1) Aggregate write and read speeds on 1, 7 and 14 clients 2) Aggregate Pileup speed on 1,7, and 14 clients accessing: A) different sets of files B) same set of files

During these tests we used 4 LUNS of 13 HDs each as recommended by IBMFor each SAN FS we have tried both IFT and FAStT disk systems

SAN File Systems

4x IFT A16F- G1A2

4x IBM FASTt 900Qlogic2x 5200

16 2x2.4+ GHz NodesQlogic 2310F HBA

Dell 5224IA32 IBM StorTank MDS

Origin 200 CXFS MDS

Page 32: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

32

SAN File Systems

1 Client 7 Clients 14 Clients

IBM IFT IBM IFT IBM IFT

Write 115 107 275 - 300 341

Read 125 135 357 252 423 322

Large sequential files: StorNext and StorTank behave in a similar manner on writes. StorNext does betteron reads. IBM disk systems are performing better than IFT on reads for multiple clients:

1 Client 7 Clients 14 Clients

IBM IFT IBM IFT IBM IFT

Write 131 157 246 300 331 340

Read 186 174 532 270 630 285

IBM StorTank

ADIC StorNext

All numbers in MB/sec

Page 33: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

33

SAN File Systems

Threads

1 Client 7 Clients 14 Clients

IBM IFT IBM IFT IBM IFT

32 A 55 91 111 88 124 102

64 A 72 120 159 116 138 72

64 B 100 23

Pileup tests: StorTank is definitevely outperforming StorNext for this type of benchmark.The results are very interesting as it comes out that peak Pileup speeds with StorTank on a single client may reach the GigE speed (case of IFT disk):

Threads

1 Client 7 Clients 14 Clients

IBM IFT IBM IFT IBM IFT

32 A 19 23 47 44 43 42

64 A 21 23 45 44 46 42

64 B 31 10

IBM StorTank

ADIC StorNext

! Unstable for IFT with more than 1 client

All numbers in MB/sec

Page 34: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

34

CXFS experience: MDS: on SGI Origin 200 with 1 GB of RAM (IRIX 6.5.22), 4 IFT arrays First numbers were not so bad, but with 4 clients or more the system becomes unstable (when they are used all at a time, one client will hang). That is what we have observed so far:

SAN File Systems

N of Clients Seq. Write Seq. Read

1 62 MB/s 130 MB/s

2 91 MB/s 245 MB/s

3 117 MB/s 306 MB/s

We are currently investigating the problem together with SGI.

Page 35: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

35

StorNext on DataDirect system

SAN File Systems

EXT2, 8 distinct LUNS

R/W, MB/sec

StorNext, 2 Power LUNS

R/W, MB/sec

1 140 / 144 178 / 180

8 470 / 700 380 / 535

16 - 570 / 1000

2x S2A8000 8 FC outlets

2x Brocade

3800

16 2x2.4+ GHz NodesEmulex LP9xxx HBAs Dell 5224

- S2A 8000 came with FC disks, although we asked for SATA- Quite easy in configuration, extremly flexible- Multiple levels of redundancy, small declared performance degradation on rebuilds- We ran only large serial wrirte and read 8GB lmdd tests using all the available power:

Page 36: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

36

- Performance of a SAN File System is quite close to that of disk hardware it is built upon (case of native FC connection). - StorNext is easiest in configuration. It does not require a standalone MDS. Works smoothly with all kinds of disk systems, fc switches etc We were able to export it via NFS, but with the loss of 50% of available bandwidth. iSCSI=? - StorTank is probably the most solid implementation of SAN FS, and it has a lot of useful options. It delivers the best numbers for random reads, and probably may be considered as a good candidate for relatively small clusters with native FC connection destinated for express data analysis. May have issues with 3rd party disks. Supports iSCSI.

- CXFS uses the very performant XFS base and hence should have a good potential, although the 2 TB file system size on Linux/32bit is a real limitation (same is true for GFS). Some functions like MDS fencing require particular hardware. iSCSI=?

- MDS loads: small for StorNext, CXFS and quite high for StorTank.

SAN File Systems – some remarks

Page 37: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

37

AFS Speedup

Page 38: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

38

- AFS performance for large files is quite poor (max 35-40 MB/sec even on a very performant hardware). To a large extent this is due to the limitations of Rx RPC protocol, and to the not most optimal implementation of the file server. - One possible workaround is to replace the Rx protocol with an alternative one in all cases where it is used for file serving. We were evaluating two such experimental implementations:

1) AFS with OSD support (Rainer Toebbicke). Rainer stores AFS data inside the Object-based Storage Devices (OSDs) which should not necessarily reside inside the AFS File Servers. The OSD performs basic space management and access control and is implemented as Linux daemon in user space on an EXT2 file system. AFS file server acts only as an MDS.

2) Reuter’s Fast AFS (Hartmut Reuter). In this approach, AFS partitions (/vicepXX) are made visible on the clients with fast SAN or NAS mechanism. As in the case 1), AFS file sever acts as an MDS and directs the clients to the right files inside the /vicepXX for faster data acess.

AFS speedup options

Page 39: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

39

Both methods worked!

The AFS/OSD scheme was tested during the Fall 2003 test session, the tests were done with the DataDirect’s S2A 8000 system. In one particular test we were able to achieve 425 MB/sec write speed for both native EXT2 and AFS/OSD configurations. The Reuter AFS was evaluated during the Spring 2004 session. StorNext SAN File System was used to distribute a /vicepX partition among several clients. Like in the previous case, AFS/Reuter performance was practically equal to the native performance of StorNext for large files.

To learn more on the DataDirect system and the Fall 2003 session, please visit the following site: http://afs.caspur.it/slab2003b.

AFS speedup options

Page 40: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

40

Lustre!

Page 41: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

41

- Lustre 1.0.4 - We used 4 Object Storage Targets on 4 Infortrend arrays, no striping - Very interesting numbers for sequential I/O (8GB files, MB/sec):

Lustre – preliminary results

N of Clients Seq. Write Seq. Read

1 72 33

6 319 234

14 310 287

- These numbers may be directly compared with SAN FS results obtained with the same disk arrays:

N of Clients Seq. Write Seq. Read

StorTank, 1 107 135

StorNext, 1 157 174

StorTank, 14 341 322

StorNext, 14 340 285

Page 42: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

42

LTO-2 Tape Drive

Page 43: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

43

The drive is a “Factor 2” evolution of its predecessor, LTO-1.According to the specs, it should be able to deliever up to 35 MB/secnative I/O speed, and 200 GB of native capacity.

We were mainly interested to check the following (see next page):

- write speed as a function of block size - time to write a tape mark - positioning times

The overall judgement: quite positive. The drive fits well for backup applications, and is acceptable for staging systems. Its strong pointIs definitively a relatively low cost (10-11 KUSD) which makes it quitecompetitive (cmp with ~30 KUSD for STK 9940B).

LTO-2 tape drive

Page 44: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

44

Write speed as a function of blocksize: > 31 MB/sec native for large blocks, very stable

LTO-2

Tape mark writing is rather slow, 1.4-1.5 sec/TM

Positioning: it may take up to 1.5 minutes to fsf to the needed file (Average= 1minute)

Page 45: ADIC / CASPUR / CERN / DataDirect / ENEA / IBM / RZ Garching / SGI New results from CASPUR Storage Lab Andrei Maslennikov CASPUR Consortium May 2004

A.Maslennikov - May 2004 - SLAB update

45

Final remarks

Our immediate plans include:

- Further investigation of StorTank, CXFS and yet another SAN file system (Veritas) including NFS export

- Evaluation of iSCSI-enabled SATA RAID arrays in combination with SAN file systems - Further Lustre testing on IFT and IBM hardware (new version: 1.2, striping, other benchmarks)

Feel free to join us at any moment !