using gpfs to manage nvram-based storage cache

Upload: heiko-joerg-schick

Post on 02-Mar-2016

46 views

Category:

Documents


1 download

DESCRIPTION

I/O performance of large-scale HPC systems grows at a significantlyslower rate than compute performance. In this article we investigatearchitectural options and technologies for a tiered storage systemto mitigate this problem. Using GPFS and flash memory cards a prototypeis implemented and evaluated. We compare performance numbersobtained by running synthetic benchmarks on a petascale BlueGene/Qsystem connected to our prototype. Based on these results an assessmentof the architecture and technology is performed.

TRANSCRIPT

  • Using GPFS to Manage

    NVRAM-Based Storage Cache

    Salem El Sayed1, Stephan Graf1, Michael Hennecke2,Dirk Pleiter1, Georg Schwarz1, Heiko Schick3, and Michael Stephan1

    1 JSC, Forschungszentrum Julich, 52425 Julich, Germany2 IBM Deutschland GmbH, 40474 Dusseldorf, Germany

    3 IBM Deutschland Research & Development GmbH, 71032 Boblingen, Germany

    Abstract. I/O performance of large-scale HPC systems grows at a sig-nificantly slower rate than compute performance. In this article we inves-tigate architectural options and technologies for a tiered storage systemto mitigate this problem. Using GPFS and flash memory cards a proto-type is implemented and evaluated. We compare performance numbersobtained by running synthetic benchmarks on a petascale BlueGene/Qsystem connected to our prototype. Based on these results an assessmentof the architecture and technology is performed.

    1 Introduction

    For very large high-performance computing (HPC) systems it has become amajor challenge to maintain a reasonable balance of compute performance andperformance of the I/O sub-system. In practice, this gap is growing and systemsare moving away from Amdahls rule of thumb for a balanced performance ratio,namely a bit of I/O per second for each instruction per second (see [1] for anupdated version). Bandwidth is only one metric which describes the capabilityof an I/O sub-system. Additionally, capacity and access rates, i.e. number of I/Orequests per second which can be served, have to be taken into account. Whiletodays technology allows to build high capacity storage systems, reaching highaccess rates is even more challenging than improving the bandwidth.

    These trends of technology are increasingly in conict with application de-mands in computational sciences, in particular as these are becoming more I/Ointensive [2]. Using traditional technologies these trends are not expected to re-verse. Therefore, the design of the I/O sub-system will be a key issue whichneeds to be addressed for future exascale architectures.

    Today, storage systems attached to HPC facilities typically comprise an ag-gregation of spinning disks. These are hosted in external le servers which areconnected via a commodity network to the compute system. This design has theadvantage that it allows to integrate a huge number of disks and to ensure theavailability of multiple data paths for resilience.

    The extensive use of disks is largely driven by costs. Disk technology hasimproved dramatically over a long period of time in terms of capacity versus

    J.M. Kunkel, T. Ludwig, and H. Meuer (Eds.): ISC 2013, LNCS 7905, pp. 435446, 2013.c Springer-Verlag Berlin Heidelberg 2013

  • 436 S. El Sayed et al.

    0

    0.05

    0.1

    0.15

    0.2

    0.25

    0 20 40 60 80 100 120

    Ban

    dwid

    th [G

    B/s]

    Time [hour]

    Read IO nodeWrite IO node

    0

    1

    2

    3

    4

    5

    6

    7

    8

    0 20 40 60 80 100 120

    Ban

    dwid

    th [G

    B/s]

    Time [hour]

    ReadWrite

    Fig. 1. Read and write bandwidth averaged over 120 s as a function of time. The leftpane shows the measurements on a single BlueGene/P I/O node while the right paneshows the results from all 600 I/O nodes of the JUGENE system.

    costs. Bandwidth per disk and access rates, however, increase at a much slowerrate (see, e.g., [3]) than compute performance. High bandwidth can thus onlybe achieved by increasing the number of disks within a single I/O sub-system,with load being suitably distributed. For HPC systems meanwhile the numberof disks started to be mainly determined by bandwidth and not by capacityrequirements. Scaling of disks is, however, limited by cost and power budget aswell as the exponentially increasing risk of failures and data corruption. Usingtodays disk technology to meet exascale bandwidth requirements of 60TByte/s[4] would exceed the currently mandated power budget of 20MW, Whereassatisfying exascale bandwidth with ash memory and satisfying exascale capacitywith a disk tier is possible at an aordable power consumption.

    To meet future demands it is therefore necessary to consider other non-volatilestorage technologies and explore new designs for the I/O sub-system architec-ture. Promising opportunities arise from storage devices based on ash memory.Compared to disk technologies they feature order(s) of magnitude higher band-width and access rates. The main disadvantage is the poor capacity versus costsratio, but device capacity is slowly increasing.

    In this article we explore an architecture where we integrate ash memorydevices into IBMs General Parallel File System (GPFS) to implement anintermediate layer between compute nodes and disk-based le servers. GPFSsInformation Lifecycle Management (ILM) [5] is used to manage the tiered storagesuch that this additional complexity is hidden from the user. The key feature ofGPFS which we exploit is the option to dene groups for dierent kind of storagedevices, known as GPFS storage pools. Furthermore, the GPFS policy engineis used to manage the available storage and handle data movement betweendierent storage pools.

    This approach is motivated by the observation that applications running onHPC systems as they are operated at Julich Supercomputing Centre (JSC)tend to start bursts of I/O operations.1 In Fig. 1 we show the read and write

    1 Such behaviour has also been reported elsewhere in the literature, see, e.g., [6].

  • Using GPFS to Manage NVRAM-Based Storage Cache 437

    bandwidth measured on JSCs petascale BlueGene/P facility JUGENE. Oneach I/O node the bandwidth has been determined from the GPFS counterswhich have been retrieved every 120 s.

    If we assume an intermediate storage layer being available providing a muchhigher I/O bandwidth to the compute nodes (but an unchanged bandwidth to-wards the external storage) it is possible to reduce the time needed for I/O.Alternatively, it is also possible to lower the total power consumption by pro-viding the original bandwidth through ash, and reduce the bandwidth of thedisk storage. This requires mechanisms to stage data kept in the external storagesystem before it is read by the application, as well as to cache data generated bythe application before it is written to the external storage. All I/O operations re-lated to staging are supposed to be executed asynchronously with respect to theapplication. The fast intermediate storage layer can also be used for out-of-corecomputations where data is temporarily moved out of main memory. Anotheruse case is check-pointing. In both cases we assume the intermediate storagelayer to be large enough to hold all data such that migration of the data to theexternal storage is avoided.

    The key contributions of this paper are:

    1. We designed and implemented a tiered storage system where non-volatileash memory is used to realize a fast cache layer and where resources anddata transfer are managed by GPFS ILM features.

    2. We provide results for synthetic benchmarks which were executed on a petas-cale BlueGene/Q system connected to the small-scale prototype storage clus-ter. We compare results obtained using dierent ash memory cards.

    3. Finally, we perform a performance and usability analysis for such a tieredstorage architecture.We use I/O statistics collected on a BlueGene/P systemas input for a simple model for using the system as a fast write cache.

    2 Related Work

    In recent years a number of papers have been published which investigate theuse of fast storage technologies for staging I/O data, as well as dierent softwarearchitectures to manage such a tiered storage architecture.

    In [7] part of the nodes volatile main memory is used to create a temporaryparallel le system. The performance of their RAMDISK nominally increases atthe same rate as the number of nodes used by the application is increased. Theauthors implemented their own mechanisms to stage the data before a job startsand after the job completed. Data staging is controlled by a scheduler which iscoupled to the systems resource manager (here SLURM).

    The DataStager architecture [8, 9] uses the local volatile memory to keepbuers needed to implement asynchronous write operations. No le system isused to manage the buer space, instead a concept of data objects is introducedwhich are created under application control. After being notied, DataStager

  • 438 S. El Sayed et al.

    processes running on separate nodes manage the data transport, i.e. data trans-port is server directed and data is pulled whenever sucient resources areavailable.

    To overcome power and cost constrains of DRAM based staging, in [10] theuse of NVRAM is advocated. Here a future node design scenario is evaluatedcomprising Active NVRAM, i.e. NVRAM plus a low-power compute element.The latter allows for asynchronous processing of data chunks committed by theapplication. Final data may be ushed to external disk storage. Data transportand resource management is thus largely controlled by the application.

    3 Background

    NAND ash memory belongs to the class of non-volatile memory technologies.Here we only consider Single-Level Cell (SLC) NAND ash, which features thehighest number of write cycles. SLC ash chips currently have an enduranceof up to O(100,000) erase/write cycles. At device level the problem of failingmemory chips is signicantly mitigated by wear-leveling mechanisms and RAIDmechanisms. A large number of write cycles is critical when using ash memorydevices for HPC systems. The advantage of ash memory devices with respectto standard disks are signicantly higher bandwidth and orders of magnitudehigher I/O operation rates due to signicantly lower access latencies, at muchlower power consumption.

    GPFS is a exible, scalable parallel le system architecture which allows todistribute both data and metadata. The smallest building block, a NetworkShared Disk (NSD), which can be any storage device, is dedicated for data only,metadata only, or both. In this work we exploit several features of GPFS. First,we make use of storage pools which allows to group storage elements. Pools aretraditionally used to handle dierent types or usage modes of disks (and tapes).In the context of tiered storage this feature can be used to group the ash storageinto one and the slower disk storage into another pool. Second, a policy engineprovides the means to let GPFS manage the placement of new les and stagingof les. This engine is controlled by a simple set of rules.

    4 GPFS-Based Staging

    In this paper we investigate a setup where ash memory cards are integratedinto a persistent GPFS instance. GPFS features are used (1) to steer initial leplacement, (2) to manage migration from ash to external, disk-based storage,and (3) to stage les from external storage to ash memory. The user thereforecontinues to access a standard le system.

    We organise external disk storage and ash memory into a disk and flashpool, respectively. Then we use the GPFS policy engine to manage automaticdata staging. In Fig. 2 we show our policy rules with the parameters dened inTable 1. These rules control when the GPFS events listed in Table 1 are thrown,the numerical values should be set according to operational characteristics of an

  • Using GPFS to Manage NVRAM-Based Storage Cache 439

    RULE SET POOL flash LIMIT(fmax)RULE SET POOL disk

    RULE MIGRATE

    FROM POOL flash

    THRESHOLD(fstart,fstop)WEIGHT(CURRENT_TIMESTAMP-ACCESS_TIME)

    TO POOL disk

    Fig. 2. The GPFS policy rules for the target file system defines the placement andmigration rule for the files data blocks

    HPC systems workload. Files are created in the storage pool flash as long aslling does not exceed the threshold fmax. When lling exceeds fmax the secondrule applies which allows les to be created in the pool disk. This fall-backmechanism is foreseen to avoid writes failing when the pool flash is full butthere is free space in the storage pool disk.

    The third rule manages migration of data from the pool flash to disk. If thelling of the pool flash exceeds the threshold fstart the event lowDiskSpace isthrown, and automatic data migration is initiated. Migration stops once llingof pool flash drops below the limit fstop (where fstop < fstart). For GPFSto decide which les to migrate rst, a weight factor is assigned to each le,which we have chosen such that least recently accessed les are migrated rst.Migration is controlled by a callback function which is bound to the eventslowDiskSpace and noDiskSpace. At any time these events occur, the installedpolicy rule for the according le system are re-evaluated. The GPFS parameternoSpaceEventInterval denes at which intervals the events lowDiskSpace andnoDiscSpace may occur (it defaults to 120 seconds).

    Finally, to steer staging of les to ash memory we propose the implementa-tion of a prefetch mechanism which is either triggered by the user application orthe systems resource manager (like in [7]). Technically this can be realized byusing the command mmchattr -P flash . To avoid this le beingautomatically moved back to the pool disk, the last access time has to be up-dated, too. A possible way to achieve this is to create a library which will oer afunction like prefetch(). This library also has to ensure that onlyone rank of the application is in charge.

    Table 1. Policy rule parameters and the values used in this paper

    Parameter Description Value Event

    fmax Maximal filling (in percent) 90 noDiskSpacefstart Filling limit which starts migration (in percent) 30 lowDiskSpacefstop Filling limit which stops migration (in percent) 5

  • 440 S. El Sayed et al.

    The optimal choice of the parameters fstart, fstop and fmax depends on howthe system is used. For larger values of fmax the probability of les being openedfor writing in the slow disk pool reduces. Large values for fstart and fstop maycause migration to start too late or nish too early and thus increase the riskof the pool flash becoming full. On the other hand, small values would lead toearly migration which could have bad eects on read performance in case theapplication tries to access the data for reading after migration.

    5 Test Setup

    5.1 Prototype Configuration

    Our prototype I/O system JUNIORS (JUlich Novel IO Research System) con-sists of IBM x3650 M3 servers with two hex-core Intel Xeon X5650 CPUs (run-ning at 2.67GHz) and 48GBytes DDR3 main memory. For the tests reportedhere we use up to 6 nodes. Each node is equipped with 2 ash memory cardseither from Fusion-io or Texas Memory Systems (TMS). The most importantperformance parameters of these devices as reported by the corresponding vendorhave been collected in Table 2. Note that these numbers only give an indicationof the achievable performance. Each node is equipped with 2 dual-port 10-GbEadapters. The number of ports has been chosen such that the nodes nominalbandwidth to the ash memory and the network is roughly balanced. Channelbonding is applied to reduce the number of logical network interfaces per node.

    The Ethernet network connects the prototype I/O system to the peta-scaleBlueGene/Q system JUQUEEN and the peta-scale disk storage system JUST.To use the massively parallel compute system for generating load, we mount ourexperimental GPFS le system on the BlueGene/Q I/O nodes. For an overviewof the prototype system and the Ethernet interconnect see Fig. 3 (left pane).

    On each node we installed the operating system RHEL6.1 (64bit) and GPFSversion 3.5.0.7. For the Fusion-io ioDrive Duo we used the driver/utility soft-ware version 3.1.5 including the rmware v7.0.0 revision 107322. For the TMSRamSan-70 cards we used the driver/utility software version 3.6.0.11.

    To monitor data ow within the system we designed a simple monitoringinfrastructure consisting of dierent sensors as shown in Fig. 3 (right pane). Thetools netstat and iostat are used to sense the data ow between node andnetwork as well as through the Linux block layer, respectively. Vendor specic

    Table 2. Manufacturer hardware specification of the flash memory cards2

    Fusion-io ioDrive Duo SLC TMS RamSan-70

    Capacity 320 GByte 450 GByte

    Read/write bandwidth [GByte/s] 1.5 / 1.5 1.25 / 0.9

    Read/write IOPS 261,000 / 262,000 300,000 / 220,000

    2 For the RamSan-70 we report the performance numbers for 4 kBytes block size.

  • Using GPFS to Manage NVRAM-Based Storage Cache 441

    2x10GE(bonding)2x per node

    JUQUEEN

    JUST34x10GE

    each node

    ~60GB/s

    2x10GE(bonding)per ION

    juniors1juniors2juniors3

    JUSTstorage network blocklayer

    flashdriver

    flashmemory

    GPFS

    (vendor specific)

    mmpmon

    networkdriver

    netstat

    iostat

    Fig. 3. The left pane shows a schematic view of the I/O prototype system and thenetwork through which it is connected to a compute system as well as a storage system.The right pane illustrates the data flow monitoring infrastructure.

    tools are used to sense the amount of data written (read) to (from) the ashmemory device. GPFS statistics is collected using the mmpmon facility.

    5.2 Benchmark Definitions

    To test the performance we used three dierent benchmarks. For network band-width measurements we used the micro-benchmark nsdperf, a tool which ispart of GPFS. It allows to dene a set of server and clients and then generatethe network trac which real write and read operations would cause, withoutactually performing any disk I/O.

    For sequential I/O bandwidth measurements we used the SIONlib [11] bench-mark tool partest, which generates a load which is similar to what one wouldexpect from HPC applications. These often perform task-local I/O where eachjob rank creates its own le. If the number of ranks is large then the time neededto handle le system metadata will become large. This is a software eect whichis not mitigated by using faster storage devices like ash memory cards. SIONlibis a parallel I/O library that addresses this problem by transparently mapping alarge number of task-local les onto a small number of physical les. To bench-mark our setup we run this test repeatedly using 32,768 processes, distributedover 512 BlueGene/Q nodes. During each iteration 2TBytes of data is writtenin total.

    To measure bandwidth as well as I/O operation (IOP) rates we furthermoreuse the synthetic benchmark tool IOR [12] to generate a parallel I/O workload.It is highly congurable and supports various interfaces and access patternsand thus allows mimicking the I/O workload of dierent applications. We used512 BlueGene/Q nodes with one task per node which all performed randomI/O operations on task-local les opened via the POSIX interface, with the agO DIRECT set to minimize caching eects. In total 128GBytes are written andread using transfers of size 4 kBytes.

  • 442 S. El Sayed et al.

    0 0.5

    1 1.5

    2 2.5

    3 3.5

    4 4.5

    0 50 100 150 200 250 300 350

    Ban

    dwid

    th [G

    B/s]

    Time [s]

    FusionIO monitoringnetwork recieved

    network sendiostat read

    iostat writeflashcard read

    flashcard write

    0

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    0 100 200 300 400 500 600

    Ban

    dwid

    th [G

    B/s]

    Time [s]

    TMS Ramsan monitoringnetwork recieved

    network sendiostat read

    iostat writeflashcard read

    flashcard write

    Fig. 4. Read and write bandwidth as a function of time for the SIONlib benchmarkusing flash memory cards from Fusion-io (left) or TMS (right). Both figures shows theI/O monitoring of one JUNIORS node (2 flash cards).

    6 Evaluation

    We start the performance evaluation by testing the network bandwidth betweenthe JUQUEEN system, where the load is generated, and the prototype I/Osystem JUNIORS using nsdperf. For these tests we used 16 Blue Gene/Q IOnodes as clients and 2 JUNIORS nodes as servers. We found a bandwidth of8.1GByte/s for emulated writes and 9.7 for reads. Comparing these results withthe nominal parameters of the ash cards listed in Table 2 indicates that ournetwork provides sucient bandwidth to balance the bandwidth to 2 ash cardsper node.

    For the following benchmarks each JUNIORS server was congured as anNSD server within a GPFS le system which was mounted by JUQUEEN I/Onodes. Using 16 clients we measured the I/O bandwidth for sequential accessusing partest. The observed read bandwidth is 12.5Gbyte/s using 4 JUNIORSserver and 8 Fusion-io ash cards, slightly more than we would expect from thevendors specications. However, a signicant dierent behaviour is observed forwriting where we observe a drop of the performance of more than 40% from 6.5to 3.7GByte/s after a short period of writing. To investigate the cause for thisbehaviour we analyse the data ow information from our monitoring system.The sensor values collected on 1 out of 4 nodes is plotted in Fig. 4 (left pane).The benchmark rst performs a write and then a read cycle. We rst notice thatthe amount of data passing the network device is consistent with the amount ofdata transferred over the operating systems block device layer. At the beginningof the initial write cycle the amount of data received via the network agrees withthe amount of data written to the ash card. However, after 67 s of writing weobserve a drop in the bandwidth of received data while at the same time the ashcard reports read operations. We observe that the amount of data passing theblock device layer towards the processor remains zero. Furthermore, the amountof data written to the ash card agrees with the amount of data received via thenetwork plus the amount of data read from the ash cards. We therefore conclude

  • Using GPFS to Manage NVRAM-Based Storage Cache 443

    that the drop in write performance is caused by read and write operations fromand to the ash card initiated by the ash card driver.

    Let us now consider the performance obtained when using the ash cardsfrom TMS. Here we used 2 JUNIORS servers with 2 ash cards each. We ob-serve a read (write) bandwidth of 5.7 (3.2) GByte/s. This corresponds to 114%(90%) of the read (write) bandwidth one would naively expect from the vendorsperformance specication. As can be seen from Fig. 4 the write bandwidth issustained during the whole test.

    Next, we evaluated the I/O operation (IOPS) rate which we can obtain onour prototype for a random access pattern generated by IOR. We again used thesetup with 2 JUNIORS servers and 4 TMS cards. For comparison we repeatedthe benchmark run using the large-capacity storage system JUST, where about5,000 disks are added to a single GPFS le system. In Table 3 the mean IOPSvalues for read and write are listed. We observe the IOPS rate on our prototypeI/O system to be signicantly higher than on the standard, disk-based storagesystem. For a fair comparison it has to be taken into account that the storagesystem JUST has been optimized for streaming I/O with disks organized intoRAID6 arrays, with large stripes of 4MBytes. Furthermore, the prototype andcompute system are slightly closer in terms of network hops. The results on theprototype are an order of magnitude smaller than one would naively expect fromthe vendors specication (see Table 2). These numbers we could only reproducewhen performing local raw ash device access without a le system.

    In the nal step we dened two storage pools in the GPFS le system. Forthe pool flash we used 4 TMS cards in 2 JUNIORS nodes. the pool disk wasimplemented with 6 Fusion-io cards in 3 other servers, because there were nodisks in the JUNIORS cluster. These pools were all congured as dataOnly, whilea third pool on yet another ash disk is used for metadata. The GPFS policyplacement and a migration rule are dened as shown in Fig. 2 and Table 1. Toevaluate the setup we use the SIONlib benchmark partest, which uses POSIXI/O routines, as well as the IOR benchmark, which we congured such thatMPI-IO is used, to create 1.5TBytes of data.

    Fig. 5 we show the data throughput at the dierent sensors for the dierentbenchmarks as a function of time. For the two dierent benchmarks no signicantdierences are observed. We obtained similar results for IOR using the POSIXinstead of the MPI-IO interface. The SIONlib benchmark starts with a writecycle followed by a read cycle, which here starts and ends about 500 s and 800 safter starting the test, respectively. Initially all data is placed in pool flash.After 260 s GPFS started the migration process. There is only a small impact

    Table 3. IOPS comparison between two JUNIORS nodes using 4 TMS RamSan-70cards and the classical scratch file system using about 5,000 disks

    storage system write IOPS (mean) read IOPS (mean)

    JUNIORS 52234 122123

    JUST 20456 69712

  • 444 S. El Sayed et al.

    0

    0.5

    1

    1.5

    2

    2.5

    3

    0 300 600 900 1200 1500 1800 2100

    Ban

    dwid

    th [G

    B/s]

    Time [s]

    iostat read (flash)iostat write (flash)

    iostat read (disk)iostat write (disk)

    GPFS migration

    start

    Benchmark end

    GPFS migrationend

    0

    0.5

    1

    1.5

    2

    2.5

    3

    0 300 600 900 1200 1500 1800 2100Time [s]

    iostat read (flash)iostat write (flash)

    iostat read (disk)iostat write (disk)

    GPFSmigration

    start

    Benchmark end

    GPFS migrationend

    Fig. 5. Read and write bandwidth (per node) as a function of time for the SIONlibbenchmark using the POSIX interface (left) and the IOR benchmark using the MPI-IOinterface (right)

    seen on the I/O performance. After the benchmark run ended it took up to 17minutes to nish the migration.

    7 Performance Analysis

    To assess the potential of this architecture being used exclusively as a write cache,it is instructive to consider a simple model. Let us denote the time required toperform the computations and (synchronous) I/O operations by tcomp and tio,respectively, and assume tcomp > tio. While the application is computing, datacan be staged between disk and ash pool. Therefore, it is reasonable to choosethe bandwidth between the compute system and the staging area y = tcomp/tiotimes larger than between staging area and the external storage. As a result thetime for I/O should reduce by a factor 1/y and therefore the overall executiontime should reduce by a factor (tcomp + tio/y)/(tcomp+ tio) = (y+ 1/y)/(y+ 1).In this simple model the performance gain for the overall system would be upto 17% for y = 1+

    2. Since the storage subsystem accounts only for a fraction

    of the overall costs, this could signicantly improve overall cost eciency.To investigate this further we implemented a simple simulation model as

    shown in Fig. 6. The model mimics the GPFS policy rules used for the JU-NIORS prototype. The behaviour is controlled by the policy parameters fstart,fstop and fmax as well as a set of bandwidth parameters. For each data path adierent bandwidth parameter is foreseen. The bandwidth along the data pathconnecting processing device and pool flash as well as the pools flash anddisk may change when data migration is started, like it is observed for ourprototype. All bandwidth parameters are chosen such that the performance ofour prototype is resembled. Note that the bandwidth is assumed not to dependon I/O patterns. This is a simplication, as sustainable bandwidth for a disk-based system depends heavily on the I/O request sizes. One big advantage ofash storage is that it can sustain close to peak bandwidth for much small I/Orequest sizes (and le system block sizes). From the JUGENE I/O statistics (see

  • Using GPFS to Manage NVRAM-Based Storage Cache 445

    flash

    Processor

    disk

    SManager

    Fig. 6. Schematic view on the simulation model consisting of a processing device, astorage manager and 2 storage pools, flash and disk

    Fig. 1) we extract the amount of data written and the time between write op-erations. With the additional pool flash disabled the execution times obtainedfrom the model and the real execution times agree within 10%. When enablingthe write cache, we the simulated execution time reduces at a 1% level. Thisresult is consistent with above model analysis as for the monitored time periodtcomp/tio 300, i.e. the system was not used by applications which are I/Obound. This we consider typical for current HPC systems since the performancepenalty when executing a signicant amount of I/O operations is large.

    8 Discussion and Conclusions

    In this paper we evaluated the functionality and performance of an I/O proto-type system comprising ash memory cards in addition to a disk pool. We coulddemonstrate that the system is capable of sustaining high write and read band-width to and from the ash cards using a massively-parallel BlueGene/Q systemto generate the load. Taking into account that in standard user operation modeI/O operations occur in bursts we investigated how such an I/O architecturecould be used to realise a tiered storage system where ash memory is used forstaging data. We demonstrated how such a system can be managed using thepolicy rule mechanism of GPFS.

    Our results indicate that for loads extracted from current I/O statistics somegain is to be expected just using the architecture as a write cache. The statisticsare however biased as I/O bound applications are hardly using the system forperformance reasons. Transferring data sequentially using large blocks does notmeet the requirements of many scientic applications. A performance assessmentof the proposed prefetching mechanism from disk to ash could not be carried outin the scope of this paper. It requires extensions to workow managers and othersoftware tools which make it easy for the user to provide information which leswill be accessed for reading before or while executing a job (see RAMDISK [7]for a possible solution).

    The considered hierarchical storage system comprising non-volatile memoryhas an even higher potential for performance improvements when the applica-tions I/O performance is mainly limited by the rate at which read and writerequests can be performed. For our prototype system we show that a high IOPS

  • 446 S. El Sayed et al.

    rate can be achieved. We therefore expect our tiered storage system, e.g., to beparticular ecient when being used for multi-pass analysis applications perform-ing a large number of small random I/O operations.

    Acknowledgements. Results presented in this paper are partially based on theproject MPP Exascale system I/O concepts, a joined project of IBM, CSCSand Julich Supercomputing Centre (JSC) which was partly supported by the EUgrant 261557 (PRACE-1IP). We would like to thank all members of this project,in particular H. El-Harake, W. Homberg and O. Mextorf, for their contributions.

    References

    [1] Gray, J., Shenoy, P.: Rules of Thumb in Data Engineering, pp. 310 (2000)[2] Bell, G., Gray, J., Szalay, A.: Petascale computational systems. Computer 39(1),

    110112 (2006)[3] Hitachi, https://www1.hgst.com/hdd/technolo/overview/

    storagetechchart.html (accessed: January 26, 2013)[4] Stevens, R., White, A., et al. (2010), http://www.exascale.org/mediawiki/

    images/d/db/planningforexascaleapps-steven.pdf

    (accessed: January 26, 2013)[5] Mueller-Wicke, D., Mueller, C.: TSM for Space Management for UNIX GPFS

    Integration (2010)[6] Miller, E.L., Katz, R.H.: Input/output Behavior of Supercomputing Applications.

    In: SC 1991, pp. 567576. ACM, New York (1991)[7] Wickberg, T., Carothers, C.: The RAMDISK Storage Accelerator: a Method of

    Accelerating I/O Performance on HPC Systems using RAMDISKs. In: ROSS 2012,pp. 5:15:8. ACM, New York (2012)

    [8] Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.:DataStager: Scalable Data Staging Services for Petascale Applications. In: HPDC2009, pp. 3948. ACM, New York (2009)

    [9] Abbasi, H., Wolf, M., Eisenhauer, G., Klasky, S., Schwan, K., Zheng, F.:DataStager: Scalable Data Staging Services for Petascale Applications. ClusterComputing 13(3), 277290 (2010)

    [10] Kannan, S., Gavrilovska, A., Schwan, K., Milojicic, D., Talwar, V.: Using activeNVRAM for I/O staging. In: PDAC 2011, pp. 1522. ACM, New York (2011)

    [11] Frings, W., Wolf, F., Petkov, V.: Scalable Massively Parallel I/O to Task-localFiles. In: SC 2009, pp. 17:117:11. ACM, New York (2009)

    [12] Borrill, J., Oliker, L., Shalf, J., Shan, H.: Investigation of Leading HPC I/OPerformance Using a Scientific-application Derived Benchmark. In: SC 2007,pp. 10:110:12. ACM, New York (2007)

    IBM, Blue Gene and GPFS are trademarks of IBM in USA and/or other countries.Linux is a registered trademark of Linus Torvalds in the USA, other countries, orboth. RamSan and Texas Memory Systems are registered trademarks of Texas Mem-ory Systems, an IBM Company. Fusion-io, ioDrive, ioDrive2 Duo, ioDrive Duo aretrademarks or registered trademarks of Fusion-io, Inc.

    Using GPFS to Manage NVRAM-Based Storage CacheIntroductionRelated WorkBackgroundGPFS-Based StagingTest SetupPrototype ConfigurationBenchmark Definitions

    EvaluationPerformance AnalysisDiscussion and Conclusions