dell emc ready solution for hpc lustre storage · the ready solution for hpc lustre storage...

32
Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 February 2019 H17632 White Paper Abstract This white paper describes a high-throughput, scalable Lustre file system solution. The solution, which includes 14th generation PowerEdge servers and PowerVault ME storage, supports Mellanox InfiniBand EDR and Intel Omni-Path networks. Dell EMC Solutions

Upload: others

Post on 18-Apr-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 February 2019

H17632

White Paper

Abstract This white paper describes a high-throughput, scalable Lustre file system solution. The solution, which includes 14th generation PowerEdge servers and PowerVault ME storage, supports Mellanox InfiniBand EDR and Intel Omni-Path networks.

Dell EMC Solutions

Page 2: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Copyright

2 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

The information in this publication is provided as is. Dell Inc. makes no representations or warranties of any kind with respect to the information in this publication, and specifically disclaims implied warranties of merchantability or fitness for a particular purpose. Use, copying, and distribution of any software described in this publication requires an applicable software license. Copyright © 2019 Dell Inc. or its subsidiaries. All Rights Reserved. Dell Technologies, Dell, EMC, Dell EMC and other trademarks are trademarks of Dell Inc. or its subsidiaries. Intel, the Intel logo, the Intel Inside logo and Xeon are trademarks of Intel Corporation in the U.S. and/or other countries. Other trademarks may be trademarks of their respective owners. Published in the USA 02/19 White Paper H17632. Dell Inc. believes the information in this document is accurate as of its publication date. The information is subject to change without notice.

Page 3: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Contents

3 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Contents

Executive summary ........................................................................................................................ 4

Lustre file system ............................................................................................................................ 5

Ready Solution overview ................................................................................................................ 7

Performance evaluation ............................................................................................................... 16

Performance tuning parameters .................................................................................................. 27

Conclusion ..................................................................................................................................... 29

References ..................................................................................................................................... 30

Appendix A: Benchmark command reference ........................................................................... 31

Page 4: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Executive summary

4 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

Executive summary In high-performance computing (HPC), the efficient delivery of data to and from the compute nodes is critical and often complicated to execute. Researchers can generate and consume data in HPC systems at such speed that the storage components become a major bottleneck. Therefore, getting maximum performance for their applications requires a scalable storage solution. Open-source Lustre systems can deliver on the performance and storage capacity needs. However, the managing and monitoring of such a complex storage system adds to the burden on storage administrators and researchers. Performance and capacity requirements grow rapidly. Increasing the throughput and scalability of storage systems supporting the HPC system can require a great deal of planning and configuration.

The Dell EMC Ready Solution for HPC Lustre Storage is designed for academic and industry users who need to deploy a fully supported, easy-to-use, high-throughput, scale-out, and cost-effective parallel file system storage solution. The Ready Solution uses community edition Lustre 2.10.4 maintained by Whamcloud and is a scale-out storage appliance that can provide high performance and high availability. Using an intelligent, extensive, and intuitive management interface—the Integrated Manager for Lustre (IML)—greatly simplifies deploying, managing, and monitoring the hardware and file system components. The solution is easy to scale in capacity, performance, or both, thereby providing a convenient path for future growth.

The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and cost-effectiveness. The solution also uses the 14th generation of enterprise Dell EMC PowerEdge servers and the high-density Dell EMC PowerVault ME storage line, which comes with improved capacity, density, performance, simplicity, and features compared to other Dell EMC entry-level storage systems. The solution comes with full hardware and software support from Dell EMC and Whamcloud. Dell EMC and the authors of this document welcome your feedback on the solution and the solution documentation. Contact the Dell EMC Solutions team by email or provide your comments by completing our documentation survey.

Author: Jyothi Bhaskar

Business case

Solution overview

We value your feedback

Page 5: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Lustre file system

5 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Lustre file system Lustre is a parallel file system, offering high performance through parallel access to data and distributed locking. A Lustre installation consists of three key elements: the metadata subsystem, the object storage subsystem (data), and the compute clients that access and operate on the data.

The following figure shows the storage solution components that are based on the Lustre file system.

Figure 1. Lustre based storage solution components1

As shown in the figure, the metadata subsystem includes:

• Metadata targets (MDTs)—Store all file system metadata including file names, permissions, time stamps, and the location of data objects within the object storage system

• Management target (MGT)—Stores management data such as configuration information and registry

• Metadata server (MDS)—Manages the file system namespace (files and directories) and file layout allocation and determines where the files on the object storage are stored

The Lustre file system must have at least one MDT, but more than one MDT can be attached to an MDS. The MDS is scalable due to the Distributed Namespace (DNE) feature, which enables storing of metadata across multiple MDTs. Lustre DNE phase 2 enables directory striping. Typically, two servers are paired into a high availability MDS configuration.

The object storage subsystem includes one or more object storage targets (OSTs) and one or more object storage servers (OSSs). The OSTs provide storage for file object data, while each OSS manages one or more OSTs. Typically, several OSSs are active at any time. Lustre delivers increased throughput by increasing the number of active OSSs (and 1 http://wiki.Lustre.org/images/6/64/LustreArchitecture-v4.pdf

Page 6: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Lustre file system

6 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

associated OSTs), ideally in pairs. Each additional OSS increases the existing networking throughput and processing power, while each additional OST increases the storage capacity. Figure 1 shows the relationship of the MDS, MDT, OSS, and OST components of a typical Lustre configuration. Clients in the figure are the HPC cluster’s compute nodes.

A parallel file system, such as Lustre, delivers performance and scalability by distributing (“striping”) data across multiple OSTs, enabling multiple compute nodes to simultaneously access the data. A key design consideration of Lustre is the separation of metadata access from I/O data access, which improves the overall system performance.

Lustre client software is installed on the compute nodes to enable access to data that is stored on the Lustre file system. To the clients, the file system appears as a single namespace that can be mounted for access. This single mount point provides a simple starting point for application data access and enables access through native client operating system tools for easier administration.

Lustre includes a sophisticated and enhanced storage network protocol, Lustre Network, referred to as LNet, which enables Lustre to use certain types of network features. For example, when the Ready Solution is deployed with InfiniBand as the network for connecting the clients, MDS, and OSSs, LNet enables Lustre to take advantage of the RDMA capabilities of the InfiniBand fabric to provide faster I/O transport and lower latency compared to typical networking protocols.

To summarize, the elements of the Lustre file system are as follows:

• MDT—Stores the location of stripes of data, file names, time stamps, and so on

• MGT—Stores management data such as configuration and registry

• MDS—Manages the MDT, providing Lustre clients with access to files

• OST—Stores the data stripes or extents of the files on a file system

• OSS—Manages the OSTs, providing Lustre clients with access to the data

• Clients—Access the MDS to determine where files are located and then access the OSSs to read and write data

Typically, Lustre configurations and deployments are highly complex and time-consuming tasks. Lustre installation and administration is normally done through a command line interface (CLI), requiring extensive knowledge of the file system operation, along with the auxiliary tools such as LNet and the locking mechanisms. In addition, after the Lustre storage system is in place, maintaining the system and optimizing performance can be daunting. Such requirements, and the steep learning curve that is associated with them, might prevent organizations from experiencing the benefits of a parallel file system. Even for experienced Lustre system administrators, maintaining a Lustre file system can consume a significant amount of time.

Page 7: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Ready Solution overview

7 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Ready Solution overview The Dell EMC Ready Solution for HPC Lustre Storage consists of a management server, Lustre MDS, Lustre OSSs, and the associated back-end storage. The solution provides storage that uses a single namespace that is easily accessed by the cluster’s compute nodes and managed through the IML Web interface. The following figure shows the solution reference architecture with these primary components:

• Management server (IML)

• MDS pair with back-end storage (MDTs and MGT)

• OSS pair with back-end storage (OSTs)

Figure 2. Solution reference architecture

In Figure 2, the management server that runs IML 4.0.7.0 (the topmost server) is a PowerEdge R640. The MDS is a pair of PowerEdge R740 servers that are configured for high availability as active-active (in case of DNE) or active-passive (in case of no DNE). The MDS pair is attached through 12 Gb/s SAS links to a PowerVault ME4024, a 2U storage array, which hosts the MDTs.

The OSSs are a pair of PowerEdge R740 servers. This OSS pair is configured for active-active high availability. The OSS pair is attached through 12 Gb/s SAS links to four fully populated PowerVault ME4084 storage arrays with a choice of 4 TB, 8 TB, 10 TB, or 12 TB NL SAS 7.2 K RPM hard disk drives (HDDs). The four PowerVault ME4084 arrays host the OSTs for the Lustre file system.

For the data network, the solution supports the following HPC interconnects: Mellanox InfiniBand EDR and Intel Omni-Path.

Page 8: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Ready Solution overview

8 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

The following table describes the hardware and software that we validated for the solution.

Table 1. Solution hardware and software specifications

Component Specification

IML server 1 x Dell EMC PowerEdge R640

MDS 2 x Dell EMC PowerEdge R740

OSS 2 x Dell EMC PowerEdge R740

Processors • IML server: Dual Intel Xeon Gold 5118 @ 2.3 GHz • MDS and OSS servers: Dual Intel Xeon Gold 6136 @ 3.00 GHz

Memory • IML server: 12 x 8 GB 2,666 MT/s DDR4 RDIMMs • MDS and OSS servers: 24 x 16 GiB 2,666 MT/s DDR4 RDIMMs

InfiniBand HCA Mellanox ConnectX-5 EDR PCIe adapter

External storage controllers

• 2 x Dell 12 Gb/s SAS HBAs (on each MDS) • 4 x Dell 12 Gb/s SAS HBAs (on each OSS)

Object storage enclosures

4 x Dell EMC PowerVault ME4084 fully populated with a total of 336 drives; 2.69 PB raw storage capacity if equipped with 8 TB SAS drives

Metadata storage enclosure

1 x Dell EMC PowerVault ME4024 with: • 12 x 960 GB SAS SSD (No DNE) • 24 x 960 GB SAS SSD (DNE)

RAID controllers Duplex RAID in the ME4084 and ME4024 enclosures

HDDs • 84 x 8 TB 3.5 in. 7.2 K RPM NL SAS3 per ME4084 enclosure2 • 24 or 12 x 960 GB SAS3 SSDs per ME4024 enclosure

Operating system • CentOS 7.5 x86_64 • Red Hat Enterprise Linux (RHEL) 7.5 x86_64

Kernel version 3.10.0-862.el7.x86_64

BIOS version 1.4.5

Mellanox OFED version

4.4-1

Intel Omni-Path IFS version

10.8.0.0

Lustre file system version

2.10.4

IML version 4.0.7.0

The solution configuration guide, which can be obtained from your Dell EMC sales representative, provides more details about the configuration.

2 You can choose 4 TB, 8 TB, 10TB, or 12 TB NL SAS3 7.2K RPM HDDs.

Hardware and software configuration

Page 9: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Ready Solution overview

9 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

The single IML server is connected to the MDS pair and OSS pair through an internal 1 GbE network.

The management server is responsible for user interaction, as well as systems health management and basic monitoring of data that is collected and provided through an interactive Web UI console, the IML. All management and administration access to the storage is through this server. While the management server manages the solution and collects data that is related to the Lustre file system, it does not play an active operational role in the Lustre file system or the data path itself.

The IML UI removes complexities from the installation process, minimizing time to deploy and configure the Lustre system. It also automates the monitoring of the health and performance of the solution components. The automation of the monitoring provides a better service to end users without increasing the burden for system administrators. In addition, the solution provides tools that help troubleshoot problems related to performance of the file system. Finally, the monitoring tool’s capability of keeping historical information enables better planning for expansion, maintenance, and upgrades of the storage appliance.

The two PowerEdge R740 servers that are used as the MDS are cabled for high availability. In the base configuration, each server is directly attached to a single PowerVault ME4024 storage array housing the Lustre MDTs and MGT. The MDS handles file and directory requests and routes tasks to the appropriate OSTs for fulfillment. In this solution, a single Mellanox InfiniBand EDR connection or Intel Omni-Path connection transmits storage requests across LNet. The server pair is active/active high availability when there are two MDTs and active/passive when there is a single MDT.

Each MDS in the MDS pair is equipped with two dual-port 12 Gb/s SAS controllers or HBAs. These two servers are redundantly connected to the ME4024 array.

The following figure shows slot priorities for the SAS HBAs in slots 1 and 4 so that the SAS HBAs are evenly distributed across the two processors in the server for load balancing. The data network card, either Mellanox InfiniBand EDR or Intel Omni-Path is installed in slot 8, which is a PCIe x16 slot.

Figure 3. MDS slot priority and ME4024 SAS ports

The following table shows the 12 Gb/s SAS cable connections between the MDS pair and one ME4024 that hosts the MDTs.

Management server

Metadata servers and targets

Page 10: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Ready Solution overview

10 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

Table 2. MDS cabling

Server SAS PCI slot SAS port ME4024

array ME4024 controller

ME4024 controller port

lustre-mds1 Slot 1 Port 0 ME4024 #1 Controller 0 Port 3

lustre-mds1 Slot 4 Port 0 ME4024 #1 Controller 1 Port 1

lustre-mds2 Slot 1 Port 0 ME4024 #1 Controller 0 Port 1

lustre-mds2 Slot 4 Port 0 ME4024 #1 Controller 1 Port 3

In ME4 arrays, the number of drives that can be added in a linear RAID-10 is restricted to a maximum of 16. Thus, the ME4024 that is used for metadata storage in this solution can be configured fully populated with 24 drives or half-populated with 12 drives

Option 1: Fully populated array An optimal way to make use of all 24 drives for metadata is by creating two MDTs and using the DNE feature. With this option, we have two MDTs of equal capacity. Each MDT is a linear RAID 10 of 10 drives, and the remaining 4 drives in the array are configured as global hot spares across the two MDTs. One of the LUNs that hosts an MDT also hosts the MGT volume of 10 GB. The following figure shows the 24-drive configuration.

Figure 4. 24-drive configuration: Fully populated ME4024 with two MDTs enabled for DNE

This fully populated ME4024 configuration is a good option to choose if higher metadata performance is required and the cost of having 24 SSDs in the array is not much of a concern. By fully populating ME4024 with 24 x 960 GB SAS SSDs and creating two MDTs with DNE, the Lustre file system can serve up to 4.68 billion file inodes.

Option 2: Half-populated array The second option is a half-populated array with 12 drives. By populating only 50 percent of the array and creating a single MDT (no DNE), we have 1 x linear RAID 10 of 10 drives and 2 drives as hot spares, as shown in the following figure.

Page 11: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Ready Solution overview

11 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Figure 5. 12-drive configuration: Half-populated ME4024 with one MDT

The half-populated ME4024 configuration is a good option to consider if DNE or higher metadata performance is not a priority. It could also be a good option as a tradeoff between performance and the cost of fully populating the array. This option still achieves a high metadata performance, although not as high as option 1. By populating the ME4024 with 12 x 960 GB SAS SSDs and creating a single MDT, the Lustre file system can serve up to 2.34 billion file inodes.

For more details about the metadata performance of both configuration options and a comparison between the two, see Performance evaluation.

The two PowerEdge R740 servers that are used as OSSs are arranged in a two-node high-availability cluster, providing active/active access to four PowerVault ME4084 high-density storage arrays. Each ME4084 array is fully populated with 84 x 3.5 in. 7.2 K RPM NL SAS HDDs, with capacity options of 4 TB, 8 TB, and 12 TB. We evaluated a configuration that used 8 TB drives, thereby having a raw storage capacity of 2,688 TB with a total of 336 drives.

The OSSs are the building blocks of the solution. With four dual-port 12 Gb/s SAS controllers in each PowerEdge R740, the two servers are redundantly connected to each of the four PowerVault ME4084 high-density storage arrays.

The following figure shows the OSS with slot priorities, where the SAS HBAs in slots 1, 2, 4, and 5 are evenly distributed across the two processors for load balancing. The InfiniBand EDR card or Omni-Path card is installed in slot 8, which is a PCIe x16 slot.

Object storage servers and targets

Page 12: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Ready Solution overview

12 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

Figure 6. OSS slot priorities and ME4084 SAS ports

Note: To clearly display the SAS ports, Figure 6 shows one ME4084. This configuration has 4 x ME4084 arrays, as shown in Figure 2, Solution reference architecture.

The following table details the 12 Gb/s SAS cable connections between the OSS pair and four ME4084 arrays.

Table 3. Cabling for Object Storage

Server SAS PCI slot SAS port ME4084

array ME4084 controller

ME4084 controller port

lustre-oss1 Slot 1 Port 0 ME4084 #1 Controller 0 Port 3

lustre-oss1 Slot 1 Port 1 ME4084 #2 Controller 0 Port 3

lustre-oss1 Slot 2 Port 0 ME4084 #2 Controller 1 Port 3

lustre-oss1 Slot 2 Port 1 ME4084 #1 Controller 1 Port 3

lustre-oss2 Slot 1 Port 0 ME4084 #1 Controller 0 Port 1

lustre-oss2 Slot 1 Port 1 ME4084 #2 Controller 0 Port 1

lustre-oss2 Slot 2 Port 0 ME4084 #2 Controller 1 Port 1

lustre-oss2 Slot 2 Port 1 ME4084 #1 Controller 1 Port 1

lustre-oss1 Slot 4 Port 0 ME4084 #3 Controller 0 Port 3

lustre-oss1 Slot 4 Port 1 ME4084 #4 Controller 0 Port 3

lustre-oss1 Slot 5 Port 0 ME4084 #4 Controller 1 Port 3

lustre-oss1 Slot 5 Port 1 ME4084 #3 Controller 1 Port 3

lustre-oss2 Slot 4 Port 0 ME4084 #3 Controller 0 Port 1

lustre-oss2 Slot 4 Port 1 ME4084 #4 Controller 0 Port 1

Page 13: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Ready Solution overview

13 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Server SAS PCI slot SAS port ME4084

array ME4084 controller

ME4084 controller port

lustre-oss2 Slot 5 Port 0 ME4084 #4 Controller 1 Port 1

lustre-oss2 Slot 5 Port 1 ME4084 #3 Controller 1 Port 1

Figure 7 illustrates how each storage array is divided into eight linear RAID 6 disk groups, with eight data and two parity disks per virtual disk.

Figure 7. RAID 6 (8+2) LUNs layout on one ME4084

Using RAID 6, the solution provides higher reliability at a marginal cost on write performance (due to the extra set of parity data required by each RAID 6). Each OST provides about 64 TB of formatted object storage space when populated with 8 TB HDDs. Since each array has 84 drives, after creating eight RAID-6 disk groups, we have 4 spare drives per array, 2 per tray, which can be configured as global hot spares across the eight disk groups in the array. Out of every disk group, a single volume using all the space is created. As a result, a base configuration as shown in Figure 1 has a total of 32 linear RAID 6 volumes across four ME4084 storage arrays. Each of these RAID 6 volumes are configured as an OST for the Lustre file system, resulting in a total of 32 OSTs across the file system in the base configuration.

The OSTs are exposed to clients with LNet via Mellanox InfiniBand EDR or Intel Omni-Path connections. From any compute node that is equipped with the Lustre client, the entire namespace can be viewed and managed like any other file system, only with the enhancements of Lustre management.

Providing the OSSs in active/active cluster configurations yields greater throughput and product reliability. This configuration provides high availability, consequently reducing potential downtime.

Scalability

Page 14: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Ready Solution overview

14 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

The PowerEdge R740 server provides performance and density. The solution base configuration with 8 TB drives provides 2,688 TB of raw storage for each OSS pair and uses Mellanox InfiniBand EDR or Intel Omni-Path technology for very high-speed, low-latency storage transactions. The PowerEdge R740 takes advantage of the PCIe Gen3 x16 interface for the EDR or Omni-Path interconnect.

A Lustre client, version 2.10.4, for the RHEL 7.5 kernel with Mellanox OFED or HFI Omni-Path support is available for access to the Dell HPC Lustre Storage solution. The solution configuration guide, which can be obtained from your Dell EMC sales representative, provides details.

The solution can be scaled by adding OSS pairs with back-end storage, which not only increases system capacity but also increases throughput. The overall performance can be extrapolated from the base-configuration performance, which is described in Performance evaluation.

Solution networking includes the private management network and the data network through which clients access the Lustre file system.

Management network The private management network provides a communication infrastructure for Lustre, high availability functionality, and for storage configuration, monitoring, and maintenance. This network creates the segmentation that is required to facilitate day-to-day operations and limit the scope of troubleshooting and maintenance. The management server uses this network to interact with the solution components to query and collect systems health information, and to perform any management changes that are initiated by administrators. Both OSSs and MDSs interact with the management server to provide health information and performance data, and to interact during management operations. The PowerVault ME4024 and ME4084 controllers are accessed through the out-of-band (Ethernet) ports to monitor storage array health and perform management actions on the back-end storage.

This level of integration enables even an inexperienced operator to efficiently and effortlessly monitor and administer the solution. The default information is summarized for quick inspection, but users can zoom in to detailed component and operating system messages from server and storage components.

Data network The Lustre file system is served through an LNet implementation, on either Mellanox EDR or Intel Omni-Path fabric, which is the network that clients use to access data. The IML UI provides an option to configure either a single Lustre Network Identifier (NID) or multiple NIDs on metadata and object storage servers for participation in the Lustre network. For instance, you can configure your Mellanox EDR or Intel Omni-Path interface using IPoIB (ifcfg-ib0) and your 10 GbE interface using TCP (eth0) on your object storage servers, enabling both to use the Lustre network.

Mellanox EDR and Intel Omni-Path networks work at fast transfer speeds with low latency. An LNet implementation on Mellanox EDR uses the RDMA capabilities of the InfiniBand fabric to provide faster I/O transport and lower latency compared to typical networking protocols. An LNet implementation on Omni-Path uses the Performance

Networking

Page 15: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Ready Solution overview

15 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Scaled Messaging (PSM) protocol for rapid data and metadata transfer to and from MDTs and OSTs to the clients.

Maintained by Whamcloud, IML is an intuitive Web UI that provides a view of the hardware, software, and file system components. 3 It reduces the complexity that is involved with installation, configuration, and maintenance, including configuration and maintenance of high availability on the Lustre servers and LNet. Using IML, with a few mouse clicks you can easily complete many tasks that once required complex CLI instructions.

IML monitoring and management capabilities include initiating failover of the file system from one node to another (for either OSS or MDS), formatting the file system, issuing mount and unmounts of the targets, and monitoring performance of the Lustre file system and the status of its components.

While IML helps with maintenance and upgrades by enabling heath monitoring of file system components, it also helps monitor file system performance by providing real-time performance charts.

The following figure shows the IML dashboard with charts that enable monitoring of file system performance and utilization. For example, the Read/Write Heat Map chart helps with monitoring bandwidth per OST. Similarly, you can monitor metadata performance by using the Metadata Operations chart, and you can monitor OST balance or utilization in terms of capacity by using the OST Balance chart. The charts provide guidance for adjusting file system parameters, so you can optimize the file system based on the monitoring of a specific workload on the system.

Figure 8. IML dashboard charts

3 http://wiki.Lustre.org/Integrated_Manager_for_Lustre

Integrated Manager for Lustre

Page 16: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

16 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

Performance evaluation This section presents performance studies of the solution using Mellanox EDR and Intel Omni-Path data networks. The performance testing objectives were to quantify the capabilities of the solution, identify points of peak performance, and determine the most appropriate methods for scaling. We ran multiple performance studies, stressing the configuration with different types of workloads to determine the limitations of performance and define the sustainability of that performance.

We generally try to maintain a standard and consistent testing environment and methodology. As described in this section, in some areas we purposely optimized server or storage configurations and took measures to limit caching effects.

We performed the tests on the solution configuration that is shown in Table 1, Solution hardware and software specifications. The following table details the client test bed that we used to provide the I/O workload.

Table 4. Client cluster configuration

Component Specification

Operating system RHEL 7.5

Kernel version 3.10.0-862.2.3.el7.x86_64

Servers Dell EMC PowerEdge C6320

BIOS version 2.7.1

Mellanox OFED version/IFS version 4.4-1 (MOFED)/IFS 10.8.0.0

Lustre file system version 2.10.4

Number of physical nodes 32

Processors Intel Xeon CPU E5-2697 v4 @ 2.30 GHz

Memory 128 GB

Our performance analysis focused on these key performance markers:

• Throughput, data sequentially transferred in GB/s

• I/O operations per second (IOPS)

• Metadata operations per second (OP/s)

The goal was a broad but accurate review of the capabilities of the solution with Mellanox InfiniBand EDR and with Intel Omni-Path. To accomplish the goal, we used the IOzone and MDtest benchmarks. IOzone can be configured to use the N-to-N file-access method. N-to-N load was tested, where every thread of the benchmark (N clients) writes to a different file (N files) on the storage system. For examples of the commands that we used to run these benchmarks, see Appendix A: Benchmark Command Reference.

We ran each set of tests on a range of clients to test the scalability of the solution. The number of simultaneous physical clients involved in each test ranged from a single client to 32 clients. The number of threads per node corresponds to the number of physical compute nodes, up to 32. The total number of threads above 32 were simulated by

Page 17: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

17 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

increasing the number of threads per client across all clients. For instance, for 128 threads, each of the 32 clients ran four threads.

To prevent inflated results due to caching effects, we ran the tests with a cold cache. Before each test started, the Lustre file system under test was remounted. A sync was performed, and the kernel was instructed to drop caches on all the clients and Lustre servers (MDS and OSS) with the following commands:

sync echo 3 > /proc/sys/vm/drop_caches

In measuring the solution performance, we performed all tests with similar initial conditions. The file system was configured to be fully functional and the targets tested were emptied of files and directories before each test.

To evaluate sequential reads and writes, we used the IOzone benchmark version 3.465 in the sequential read and write mode. We conducted the tests on multiple thread counts, starting at 1 thread and increasing in powers of 2 to 256 threads. Because this test works on one file per thread, at each thread count, the number of files equal to the thread count were generated. The threads were distributed across 32 physical client nodes in a round-robin fashion.

We converted throughput results to GB/s from the KB/s metrics that were provided by the tool. We selected an aggregate file size of 2 TB, which was equally divided among the number of threads within any given test. We chose the aggregate file size so that it was large enough to minimize the effects of caching from the servers and Lustre clients. Operating system caches were also dropped or cleaned on the client nodes between tests and iterations and between writes and reads.

For all these tests, we used a Lustre stripe size of 1 MB. Stripe count was 1 for threads greater than or equal to 32. For a thread count less than 32, the files were striped across all 32 OSTs with a stripe count of 32. The files that were written were distributed evenly across the OSTs (round-robin) to prevent uneven I/O loads on any single SAS connection or OST, in the same way that a user would expect to balance a workload.

We used the same test methodology when testing the solution with the Mellanox InfiniBand EDR and the Intel Omni-Path data network.

The following figures shows sequential N-N performance of the solution with Mellanox EDR.

Sequential reads and writes N-to-N

Page 18: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

18 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

Figure 9. Sequential N-N read and write with EDR using IOzone: 2 TB aggregate data size

As the figure shows, the peak throughput of the system was attained at 32 threads. The peak write was 21.27 GB/s and the peak read was 22.56 GB/s. The single thread write performance was 622 MB/s and read performance was 643 MB/s. The read and write performance scaled almost exponentially up to 32 threads or until the system attained its peak. Then the writes saturated as we scaled, and the reads dipped at 64 threads and saturated thereafter. Thus, the overall sustained performance of this configuration for reads and writes was approximately 20 GB/s.

The following figure shows sequential N-N performance of the solution with Intel Omni-Path.

Figure 10. Sequential N-N read and write with Omni-Path using IOzone: 2 TB aggregate data

size

Page 19: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

19 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

As Figure 10 shows, the peak throughput of the system with Omni-Path was attained at 32 threads. The peak write was 21.3 GB/s and the peak read was 22.6 GB/s. The single thread write performance was 619 MB/s and read performance was 625 MB/s. The read and write performance scaled almost exponentially up to 32 threads or until the system attained its peak. Then the writes saturated as we scaled, and reads dipped until 128 threads and saturated thereafter close to 20 GB/s . Thus, the overall sustained performance of this configuration for reads and writes was approximately 20 GB/s.

In both Figure 9 and Figure 10, we also see that the reads were very close to or slightly lower than the writes as the thread count increased. We ran server-to-storage tests using OBDFilter Survey, a tool included in the Lustre distribution.4 The results from this test confirmed our understanding that the read performance of this configuration could be better if there were more threads than objects (N-M case, N: threads, M: objects, and N > M), thereby presenting a higher queue depth or more I/O at the ME4 controllers.

Figure 11 and Figure 12 are charts from the server-to-storage tests of 64 and 256 threads, respectively. As shown, the read and write throughputs were very close to each other, and at times the write throughput was higher than the read throughput. (The network was not involved in these tests.) However, as we provided more I/O and increased the number of threads beyond the number of objects, the read performance surpassed the write performance. This confirms the back-end storage behavior that was shown by IOzone.

Figure 11. OBDFilter Survey, effect of thread count on throughput: 64 objects

4 http://wiki.Lustre.org/OBDFilter_Survey

Page 20: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

20 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

Figure 12. OBDFilter Survey, effect of thread count on throughput: 256 objects

To evaluate random I/O performance, we used IOzone version 3.465 in the random mode. We conducted the tests on multiple thread counts, starting at 16 threads and increasing in powers of 2 to 256 threads. To minimize caching effects, we chose an aggregate file size of 1 TB across all thread counts, which was equally divided among all threads in any given test. The IOzone host file was arranged to distribute the workload evenly across the compute nodes. We used a Lustre stripe count of 1 and stripe size of 4 MB. We used a 4 KB request size because it aligns with the Lustre file system’s 4 KB block size and is representative of small block accesses for a random workload. Performance was measured in IOPS. The operating system caches were dropped between the runs on the Lustre servers and Lustre clients.

The following figure shows random read and write performance with Mellanox EDR. The random writes peaked at 256 threads with 28.53K IOPS, slowly increasing after 32 threads, while random reads grew steadily as the number of threads increased, peaking at 256 threads with 34.06K IOPS. The IOPS of random reads increased rapidly from 32 to 256 threads, although the write IOPs were better than the read IOPS for thread counts less than 256.

Random reads and writes N-to-N

Page 21: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

21 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Figure 13. Random N-N read and write with EDR using IOzone: 1 TB aggregate data size

The following figure shows random read and write performance with Omni-Path. The random writes peaked at 256 threads with 26.7 IOPS, slowly increasing after 32 threads, while random reads grew steadily as the number of threads increased, peaking at 256 threads at 34.5K IOPS. The IOPS of random reads increased rapidly from 32 to 256 threads, although the write IOPs were better than the read IOPS for thread counts less than 256.

Figure 14. Random N-N read and write with Omni-Path using IOzone: 1 TB aggregate data

size

We analyzed the storage array logs for the random I/O cases and determined that the ME4 array behavior for this kind of workload was as expected. The ME4 arrays are

Page 22: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

22 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

optimized with higher queue depth at the controller level for both reads and writes, which means that performance increases as the number of simultaneous I/O requests increase.

As to writes, because write-back cache is used, the caching at the controller enables a higher number of commands to be queued at the controller, resulting in more write operations in flight than the reads. This could be one of the influencing factors for writes to outperform reads at lower thread counts. However, as queues get larger, the read-modify-write nature of the writes reaches a point where the writes stop increasing before the reads do the same, which is why the reads get better by a higher factor than writes as we scale.

Metadata testing measures the time to complete certain file or directory operations that return attributes. MDtest is an MPI-coordinated benchmark that performs create, stat, and remove operations on files or directories. The MPI stack used for this study was the Intel MPI distribution. The metric reported by MDtest is the rate of completion in terms of OP/s. MDtest can be configured to compare metadata performance for directories and files. However, due to time constraints in testing, we only performed passes of file operations. To evaluate the metadata performance of the system, we used MDtest 1.9.3. We used the Intel MPI distribution.

While using DNE with two MDTs, we used directory striping. We configured the distribution of subdirectories within the parent directory in a round-robin fashion. After we completed these tests, we tested a single MDT case without DNE. This section includes a comparison of metadata performance results for both a single MDT as well as two MDTs with directory striping. We used the 32 Lustre clients that are described in Table 4 for metadata tests as well.

To understand how well the system scales and to compare the different thread cases on similar ground, we tested from a single-thread case up to a 1,024-thread case with a consistent 2,097,152 file count for each case. The following table details the number of files per directory and the number of directories per thread for every thread count. We used the same number of files per directory and directories per thread, and the same testing methodology for both the two-MDT and single-MDT test cases. We ran three iterations of each test and recorded the mean values.

Table 5. MDtest files and directory distribution across threads

Number of threads

Number of files per directory

Number of directories per thread Total number of files

1 1,024 2,048 2,097,152

2 1,024 1,024 2,097,152

4 1,024 512 2,097,152

8 1,024 256 2,097,152

16 1,024 128 2,097,152

32 1,024 64 2,097,152

64 1,024 32 2,097,152

128 1,024 16 2,097,152

Metadata performance study

Page 23: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

23 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Number of threads

Number of files per directory

Number of directories per thread Total number of files

256 1,024 8 2,097,152

512 1,024 4 2,097,152

1,024 1,024 2 2,097,152

We used the following commands to stripe the directories across the MDTs:

lfs mkdir -c 2 /mnt/Lustre/metadatatests lfs setdirstripe -D -c 2 /mnt/Lustrefs/metadatatests

The –D option ensures that the distribution of subdirectories is round–robin and the –c number is the total MDT count across which the directories are to be striped.

The following figure shows file metadata statistics for the case of two MDTs with EDR. The peak file create was approximately 90K OP/s, file stat was approximately 1,237K OP/s, and file remove was approximately 596K OP/s. All three operations behaved similarly as we scaled from 1 node to 1,024 nodes. The file stat operations—the lightest metadata operations—scaled better than the remove and create operations; conversely, file create and file remove operations were slower than the stat operations because OSTs were involved.

Figure 15. File metadata test with EDR using MDtest: 2M files on two MDTs with DNE

Page 24: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

24 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

The following figure shows the results of metadata performance on a single MDT with no DNE. The file create operations peaked at approximately 60K OP/s, file stat operations at approximately 668K OP/s, and file remove operations at approximately 239K OP/s.

Figure 16. File metadata test with EDR using MDtest: 2M files on single MDT/No DNE

The following figure shows a comparison of file create operations between a single MDT versus two MDTs with DNE. We can see that both scaled similarly. However, the file create operations with two MDTs peaked at approximately 90K OP/s, whereas with a single MDT, they peaked at approximately 60K OP/s. The create operations improved by 50 percent with two MDTs compared to a single MDT because of the directory striping feature.

Figure 17. File create with EDR: Single MDT versus two MDTs with DNE

Page 25: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

25 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

The following figure shows a comparison of file stat operations between a single MDT versus two MDTs with EDR. Both cases showed similar performance until reaching 64 threads, and the delta increased as we scaled further. The peak of file stat with a single MDT was approximately 668K OP/s whereas the peak with two MDTs was 1,237K OP/s. File stat operations improved by 85.2 percent in the case with two MDTs when compared to the case with a single MDT.

Figure 18. File stat with EDR: Single MDT versus two MDTs with DNE

The following figure shows a comparison of file remove operations between a single MDT versus two MDTs with Mellanox InfiniBand EDR. We see a difference between the two as we scaled from 1 thread to 1,024 threads. The delta is more evident starting at 16 threads. The peak for a single MDT occurred at approximately 239K OP/s, whereas for two MDTs with directory striping, the peak occurred at approximately 596K OP/s at 1,024 threads. File remove operations improved by 149.5 percent with two MDTs with DNE compared to a single MDT.

Figure 19. File remove with EDR: Single MDT versus two MDTs with EDR

Page 26: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance evaluation

26 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

The following figure shows a comparison of file create, file stat, and file remove operations between a single MDT versus two MDTs with Omni-Path. We see a difference between the two as we scaled from 1 thread to 1,024 threads. The delta is more evident starting at 16 threads. File stat operations with two MDTs with DNE peaked at 944.018K OP/s, which is 90.08 percent better than with a single MDT, where file stat operations peaked at approximately 496.637 OP/sec. Similarly, file remove operations with two MDTs with DNE peaked at 470.595K OP/S, which is 99.13 percent better than with a single MDT, where the peak was at 236.323K OP/s. File create operations with two MDTs with DNE peaked at 147.858K OP/s, which is 60.92 percent better than with a single MDT, where file create operations peaked at 91.882K OP/s.

Figure 20. File metadata operations with Omni-path using MDtest (2M files): Two MDTs with

DNE versus single MDT

Page 27: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance tuning parameters

27 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Performance tuning parameters You can configure multiple tuning parameters to achieve optimal system performance depending on intended workload patterns. This section shows the tuning parameters that we configured on the Lustre testbed system in the Dell HPC Engineering Innovations lab.

The following table details the Lustre client tuning parameters that we configured to test the solution with the Mellanox InfiniBand EDR connection.

Table 6. Lustre client tuning with Mellanox InfiniBand EDR connection

Parameter setting Description

Lustre client tuning used for sequential I/O

lctl set_param osc.*.checksums=0 Disables checksums and associated overhead.

lctl set_param osc.*OST*.max_rpcs_in_flight=256 Sets the maximum number of concurrent RPCs in flight to the OST. In most cases, this setting helps with small file I/O patterns.

lctl set_param osc.*OST*.max_dirty_mb=1024

Sets how much dirty cache per OST can be written and queued. This setting generally benefits large memory I/O workloads.

lctl set_param osc.*.max_pages_per_rpc=1024 Sets the maximum number of pages to undergo I/O in a single RPC to that OST.

lctl set_param llite.*.max_read_ahead_mb=1024 Sets the maximum amount of data read-ahead, in an RPC-sized chunk, on a file.

lctl set_param llite.*.max_read_ahead_per_file_mb=1024 Sets the maximum amount of data read-ahead per file in MB. This setting benefits sequential reads.

Lustre client tuning used for random I/O

lctl set_param osc.*OST*.max_rpcs_in_flight=256 Sets the maximum number of concurrent RPCs in flight to the OST. In most cases, this parameter helps with small file I/O patterns.

lctl set_param osc.*.max_pages_per_rpc=1024 Sets the maximum number of pages to undergo I/O in a single RPC to that OST.

This section includes details about the following parameters that we configured to test the solution with Intel Omni-Path:

• Lustre client tuning parameters

• Omni-Path Host Fabric Interface (HFI) tuning parameters

Lustre client tuning The following table details the Lustre client tuning parameters that we configured to test the solution with the Omni-Path connection.

Introduction

Tuning the solution with Mellanox EDR

Tuning the solution with Omni-Path

Page 28: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Performance tuning parameters

28 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

Table 7. Lustre client tuning with Omni-Path connection

Parameter setting Description

Lustre client tuning used for sequential I/O

lctl set_param osc.*.checksums=0 Disables checksums and associated overhead.

lctl set_param osc.*OST*.max_dirty_mb=1024

Sets how much dirty cache per OST can be written and queued. This setting generally benefits large memory I/O workloads.

lctl set_param osc.*.max_pages_per_rpc=1024 Sets the maximum number of pages to undergo I/O in a single RPC to that OST.

lctl set_param llite.*.max_read_ahead_mb=1024 Sets the maximum amount of data read-ahead, in an RPC-sized chunk, on a file.

lctl set_param llite.*.max_read_ahead_per_file_mb=1024 Sets the maximum amount of data read-ahead per file in MB. This setting benefits sequential reads.

Lustre client tuning used for random I/O

lctl set_param osc.*OST*.max_rpcs_in_flight=256 Sets the maximum number of concurrent RPCs in flight to the OST. In most cases, this parameter helps with small file I/O patterns.

lctl set_param osc.*.max_pages_per_rpc=1024 Sets the maximum number of pages to undergo I/O in a single RPC to that OST.

Omni-Path HFI tuning parameters We set the HFI driver module parameters as follows:

Client nodes:

cat /etc/modprobe.d/hfi1.conf options hfi1 pcie_caps=0x51 krcvqs=3 cap_mask=0x4c09a01cbba rcvhdrcnt=8192

Lustre server nodes:

cat /etc/modprobe.d/hfi1.conf options hfi1 pcie_caps=0x51 krcvqs=5 cap_mask=0x4c09a01cbba rcvhdrcnt=8192

For our testing with both Mellanox EDR and Intel Omni-Path networks, we set the network MTU to 65520 on servers and clients:

[root@Lustre-oss1 ~]# cat /etc/sysconfig/network-scripts/ifcfg-ib0 DEVICE=ib0 BOOTPROTO=static … ONBOOT=yes CONNECTED_MODE=yes MTU=65520

Setting network MTU

Page 29: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Conclusion

29 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Conclusion The Dell EMC Ready Solution for HPC Lustre Storage is a high-performance clustered file system solution that is easy to manage, fully supported, and capable of scaling in throughput and in capacity. The solution includes the PowerEdge 14th generation server platform, PowerVault ME4 storage products, and Lustre technology, the leading open-source solution for a parallel file system. The Integrated Manager for Lustre (IML) unifies the management of the Lustre file system and solution components into a single control and monitoring panel for ease of use.

The solution has been validated and tested for two HPC data networks, Mellanox EDR and Intel Omni-Path. Both solution stacks have shown to have sustained sequential throughput of approximately 20 GB/s, which is consistent with the needs of HPC environments.

Page 30: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

References

30 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

References The following Dell EMC documentation provides additional and relevant information. Access to these documents depends on your login credentials. If you do not have access to a document, contact your Dell EMC representative.

• Dell EMC ME4 Series Storage System Administrator’s Guide

• Dell EMC Ready Solution for HPC Lustre Storage using PowerVault ME4 Storage Line

The following Lustre and Whamcloud documentation provides additional and relevant information:

• Integrated Manager for Lustre

• OBDFilter Survey

• Introduction to Lustre Architecture

• Lustre.org Main Page (wiki)

• Whamcloud Community Portal

Dell EMC documentation

Lustre and Whamcloud documentation

Page 31: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Appendix A: Benchmark command reference

31 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4

White Paper

Appendix A: Benchmark command reference This section describes the commands that were used to benchmark the Dell HPC Lustre Storage solution.

We used the following commands to run sequential and random IOzone tests, the results of which are recorded in Performance evaluation.

For sequential writes:

iozone -i 0 -c -e -w -r 1024K -I -s $Size -t $Thread -+n -+m /path/to/threadlist

For sequential reads:

iozone -i 1 -c -e -w -r 1024K -I -s $Size -t $Thread -+n -+m /path/to/threadlist

For IOPS random reads/writes:

iozone -i 2 -w -c -O -I -r 4K -s $Size -t $Thread -+n -+m /path/to/threadlist

The following table describes the IOzone command line options. The O_Direct command line option, -I, enables us to bypass the cache on the compute nodes where the IOzone threads are running.

Table 8. IOzone command line options

Option Description

-i 0 Write test

-I 1 Read test

-I 2 Random IOPS test

-+n No retest

-c Includes close in the timing calculations

-e Includes flush in the timing calculations

-r Records size

-s File size

-+m Location of clients to run IOzone on when in clustered mode

-I Use O_Direct

-w Does not unlink (delete) temporary file

-O Return results in OPS

IOzone

Page 32: Dell EMC Ready Solution for HPC Lustre Storage · The Ready Solution for HPC Lustre Storage delivers a superior combination of performance, reliability, density, ease of use, and

Appendix A: Benchmark command reference

32 Dell EMC Ready Solution for HPC Lustre Storage Using PowerVault ME4 White Paper

We used the following command to run metadata tests, the results of which are recorded in Performance evaluation:

mpirun -np $Threads -rr --hostfile /share/mdt_clients/mdtlist.$Threads /share/mdtest/mdtest.intel -v -d /mnt/Lustre/perf_test24-1M -i $Reps -b $Dirs -z 1 -L -I $Files -y -u -t –F

The following table describes the MDtest command line options.

Table 9. MDtest command line options

Option Description

-d Directory in which the tests will run

-v Verbosity (each instance of option increments by one)

-i Number of iterations that the test will run

-b Branching factor of hierarchical directory structure

-z Depth of hierarchical directory structure

-L Files only at leaf level of tree

-I Number of items per directory in tree

-y Sync file after writing

-u Unique working directory for each task

-t Time unique working directory overhead

-F Perform test on files only (no directories)

-D Perform test on directories only (no files)

MDtest: Metadata file operations