faodel: data management for next-generation application ... · craig ulmer, shyamali mukherjee,...

6
Faodel: Data Management for Next-Generation Application Workflows Craig Ulmer, Shyamali Mukherjee, Gary Templet Sandia National Laboratories Livermore, California [email protected] Todd Kordenbrock DXC Technology [email protected] ABSTRACT Composition of computational science applications, whether into ad hoc pipelines for analysis of simulation data or into well-defined and repeatable workflows, is becoming commonplace. In order to scale well as projected system and data sizes increase, developers will have to address a number of looming challenges. Increased contention for parallel filesystem bandwidth, accomodating in situ and ex situ processing, and the advent of decentralized program- ming models will all complicate application composition for next- generation systems. In this paper, we introduce a set of data services, Faodel, which provide scalable data management for workflows and composed applications. Faodel allows workflow components to directly and efficiently exchange data in semantically appropri- ate forms, rather than those dictated by the storage hierarchy or programming model in use. We describe the architecture of Faodel and present preliminary performance results demonstrating its potential for scalability in workflow scenarios. KEYWORDS Workflow, composition, data management, scalability ACM Reference Format: Craig Ulmer, Shyamali Mukherjee, Gary Templet, Scott Levy, Jay Lofstead, Patrick Widener, Todd Kordenbrock, and Margaret Lawson. 2018. Faodel: Data Management for Next-Generation Application Workflows. In Proceed- ings of Workshop on Infrastructure for Workflows and Application Composition, Phoenix, AZ, June 12, 2018 (IWAC 2018), 6 pages. https://doLorg/10.1145/nnnnnnn.nnnnnnn 1 INTRODUCTION In order to accelerate the extraction of information from data, work- flows that couple computational science applications are proving "Also with Dartmouth College. Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the owner/author(s). IWAC 2018, lune 12, 2018, Phoenix, AZ © 2018 Copyright held by the owner/author(s). ACM ISBN 978-x-xxxx-xxxx-x/YY/MM. https://doi.org/10.1145/nnnnnnn.nnnnnnn Scott Levy, Jay Lofstead, Patrick Widener Sandia National Laboratories Albuquerque, New Mexico sllevylgflofst [email protected] Margaret Lawson* Sandia National Laboratories Albuquerque, New Mexico [email protected] increasingly useful. By allowing developers to reason about their computational problems in a modular and regular manner, such workflows trade initial complexity in their definition and construc- tion for ease of modification and increased reuse once they are in place. In order to maintain workflow construction as a viable tech- nique for computational science, that complexity must be managed as projected system and data sizes increase. Several potential complications are clouding the horizon, how- ever: Impedance mismatches between data generation rates and parallel filesystem bandwidth have been an issue in com- putational science for some time now. System architectures have addressed this with various refinements to the mem- ory hierarchy, more recently involving the development of node-local "burst buffers". While these changes provide more options and flexibility for workflow designers, they also in- troduce added design considerations. Coupling simulation and analysis using in situ and ex situ techniques present a different set of choices for workflow designers. Depending on system loads, problem inputs, or changing availability of specialized compute engines, it may be advantageous to migrate analytics tasks in and out of a particular application. While this may not make a difference in the end state of a workflow, adding this capability presents significant data management issues. While the bulk-synchronous-program (BSP) model remains dominant, emerging decentralized decomposition and sched- uling models are being explored for their potential to provide dramatically increased scalability. Assumptions underlying the design of workflows for a group of cooperating BSP ap- plications will not necessarily hold for workflows designed around asynchronous many-task (AMT) execution. For ex- ample, using a parallel filesystem to exchange data between workflow components will likely be problematic as the num- ber of discrete application tasks increases. Our group has been exploring how changing the ways in which cooperating applications exchange data can provide leverage on these kinds of concerns. This paper describes a set of services for data management and exchange, Faodel, for use in such applica- tions. Our goal is to allow workflow components to exchange data SAND2018-5192C This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressed in the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government. Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly owned subsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Upload: others

Post on 15-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Faodel: Data Management for Next-Generation Application ... · Craig Ulmer, Shyamali Mukherjee, Gary Templet, Scott Levy, Jay Lofstead, Patrick Widener, Todd Kordenbrock, and Margaret

Faodel: Data Management for Next-Generation ApplicationWorkflows

Craig Ulmer, Shyamali Mukherjee,Gary Templet

Sandia National Laboratories

Livermore, California

[email protected]

Todd KordenbrockDXC Technology

[email protected]

ABSTRACT

Composition of computational science applications, whether intoad hoc pipelines for analysis of simulation data or into well-definedand repeatable workflows, is becoming commonplace. In order toscale well as projected system and data sizes increase, developerswill have to address a number of looming challenges. Increasedcontention for parallel filesystem bandwidth, accomodating in situand ex situ processing, and the advent of decentralized program-ming models will all complicate application composition for next-generation systems. In this paper, we introduce a set of data services,Faodel, which provide scalable data management for workflowsand composed applications. Faodel allows workflow componentsto directly and efficiently exchange data in semantically appropri-ate forms, rather than those dictated by the storage hierarchy orprogramming model in use. We describe the architecture of Faodeland present preliminary performance results demonstrating itspotential for scalability in workflow scenarios.

KEYWORDS

Workflow, composition, data management, scalability

ACM Reference Format:

Craig Ulmer, Shyamali Mukherjee, Gary Templet, Scott Levy, Jay Lofstead,

Patrick Widener, Todd Kordenbrock, and Margaret Lawson. 2018. Faodel:

Data Management for Next-Generation Application Workflows. In Proceed-

ings of Workshop on Infrastructure for Workflows and Application Composition,

Phoenix, AZ, June 12, 2018 (IWAC 2018), 6 pages.

https://doLorg/10.1145/nnnnnnn.nnnnnnn

1 INTRODUCTION

In order to accelerate the extraction of information from data, work-flows that couple computational science applications are proving

"Also with Dartmouth College.

Permission to make digital or hard copies of part or all of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for third-party components of this work must be honored.For all other uses, contact the owner/author(s).

IWAC 2018, lune 12, 2018, Phoenix, AZ© 2018 Copyright held by the owner/author(s).ACM ISBN 978-x-xxxx-xxxx-x/YY/MM.https://doi.org/10.1145/nnnnnnn.nnnnnnn

Scott Levy, Jay Lofstead, Patrick WidenerSandia National Laboratories

Albuquerque, New Mexico

sllevylgflofst [email protected]

Margaret Lawson*Sandia National Laboratories

Albuquerque, New Mexico

[email protected]

increasingly useful. By allowing developers to reason about theircomputational problems in a modular and regular manner, suchworkflows trade initial complexity in their definition and construc-tion for ease of modification and increased reuse once they are inplace. In order to maintain workflow construction as a viable tech-nique for computational science, that complexity must be managedas projected system and data sizes increase.

Several potential complications are clouding the horizon, how-ever:

• Impedance mismatches between data generation rates andparallel filesystem bandwidth have been an issue in com-putational science for some time now. System architectureshave addressed this with various refinements to the mem-ory hierarchy, more recently involving the development ofnode-local "burst buffers". While these changes provide moreoptions and flexibility for workflow designers, they also in-troduce added design considerations.

• Coupling simulation and analysis using in situ and ex situtechniques present a different set of choices for workflowdesigners. Depending on system loads, problem inputs, orchanging availability of specialized compute engines, it maybe advantageous to migrate analytics tasks in and out of aparticular application. While this may not make a differencein the end state of a workflow, adding this capability presentssignificant data management issues.

• While the bulk-synchronous-program (BSP) model remainsdominant, emerging decentralized decomposition and sched-uling models are being explored for their potential to providedramatically increased scalability. Assumptions underlyingthe design of workflows for a group of cooperating BSP ap-plications will not necessarily hold for workflows designedaround asynchronous many-task (AMT) execution. For ex-ample, using a parallel filesystem to exchange data betweenworkflow components will likely be problematic as the num-ber of discrete application tasks increases.

Our group has been exploring how changing the ways in whichcooperating applications exchange data can provide leverage onthese kinds of concerns. This paper describes a set of services fordata management and exchange, Faodel, for use in such applica-tions. Our goal is to allow workflow components to exchange data

SAND2018-5192C

This paper describes objective technical results and analysis. Any subjective views or opinions that might be expressedin the paper do not necessarily represent the views of the U.S. Department of Energy or the United States Government.

Sandia National Laboratories is a multimission laboratory managed and operated by National Technology & Engineering Solutions of Sandia, LLC, a wholly ownedsubsidiary of Honeywell International Inc., for the U.S. Department of Energy’s National Nuclear Security Administration under contract DE-NA0003525.

Page 2: Faodel: Data Management for Next-Generation Application ... · Craig Ulmer, Shyamali Mukherjee, Gary Templet, Scott Levy, Jay Lofstead, Patrick Widener, Todd Kordenbrock, and Margaret

IWAC 2018, June 12, 2018, Phoenix, AZ Ulmer et al.

in forms whose semantics are dictated by the problem being solvedby the workflow rather than by considerations such as the storagehierarchy or programming model in use. This paper contributes adiscussion of the design and architecture of Faodel, a description ofits integration into a specific application, and preliminary perfor-mance results in execution scenarios representing future workflowcomponent patterns.

2 FAODEL DESIGN & ARCHITECTURE

The design of Faodel is largely driven based on requirements derivedfrom the operating environment of today's high-performance com-puting (HPC) platforms. As depicted in Figure 1, Faodel provides ameans of connecting several different jobs that run concurrentlyon a platform. First, BSP or AMT parallel simulation jobs run andproduce data objects that are either stored in internal resources orpublished to other resources. While coupling and workflow scenar-ios may use Faodel to pass data between components, this morecomplex use case has not been fully explored yet. Second, Faodelmay use distributed memory and NVM to absorb bursts of datafrom the application or replay results from simulations to requesters.Third, in situ analysis & visualization (ISAV) tools use Faodel toretrieve and analyze data. Finally, coordination applications providea means of helping the different jobs in the environment locate andconnect with the resources of other jobs.An examination of the application environment motivated three

fundamental requirements for the design of Faodel:

• Faodel must provide basic primitives for users to reasonabout and decompose their datasets, but at the same time theAPI must be as agnostic as possible about how developersmanage their data. Rather than force users to design algo-rithms around a data stores indexing and migration policies,it is better to provide mechanisms for users to express howthe system should manage their data. By offering mecha-nisms to control data epoch visibility and a simple key/blobinterface, Faodel offers an approach that can serve manykinds of clients.

• To aid scalability, separating application fates (i.e., the simu-lation from the analytics) also offers independent scalabilitythrough loose coupling. This requires using a communica-tion layer that offers efficient data transfers between jobswhile at the same time not breaking the communication li-braries used within jobs (e.g., MPI). As such, Faodel cannotsimply rely on sockets or splitting an MPI communicator andmust instead use a low-level Remote DMA (RDMA) commu-nication layer. This layer is an evolution of the long provenNNTI layer from the Nessie RPC library [9, 13].

• Faodel must provide a way of migrating data objects frommemory to higher-capacity resources, such as burst buffersor the parallel filesystem (PFS). This requirement impliesFaodel must transition in-memory objects to systems withvendor-proprietary or file-based APIs.

Faodel is made up of several software components. We describethe higher-level components most relevant to application develop-ers in the remainder of this section.

2.1 Kelpie

Kelpie provides a key/blob abstraction to facilitate flexible dataexchange between different executables (e.g., simulation applica-tion and applications for visualization and analysis). A key is aprogrammer-defined text string that allows the programmer to at-tach semantic significance to the associated data, a blob. Although akey attaches programmer-cognizable meaning (and possibly struc-ture) to a blob, Kelpie is entirely ignorant of any meaning attachedto a key or its associated blob. An example key might encode theapplication name, run number, iteration number, variable name,and some information about what part of that globally distributedarray this blob represents. Separate processes can exchange datavia Kelpie by exchanging key information. Key exchange can beexplicit or implicit (i.e., keys can be constructed in a well-knownway).

2.2 Opbox

OpBox is a library for implementing asynchronous communicationbetween multiple entities in a distributed application. Our experi-ences with remote procedure call (RPC) libraries found that whileRPCs provide a simple way to coordinate data transfers and invokeaction at remote nodes, it is often difficult to coordinate more so-phisticated data management services (eg, ones involving morethan two nodes, time-out conditions, or race conditions). Ratherthan leave the task of coordinating transfers entirely to the nextlayer up, OpBox provides the user with primitives for expressinga protocol as a state machine that the communication layer canprocess in an asynchronous manner. A communication patternbetween a collection of nodes in OpBox is an Op. Users define andinstantiate various Op classes as desired. Each provides a handle tothe Op (implemented in C++ via f utu re/promi se), and a methodto instruct OpBox to start running its encapsulated state machine.As each node in the Op communication pattern receives the Op, itprocesses the state machine accordingly.Opbox provides a collection of operations that are common to

many applications including ping and counter operations. In addi-tion, Opbox includes a Directory Manager Service that can be easilyincorporated into an application. The Directory Manager storesnode information in a hierarchical directory. A typical Faodel appli-cation would have at least one Directory Manager instance actingas a naming service to locate components of an application.

2.3 Lunasa

Lunasa provides user-level memory management services for net-work memory. For performance reasons, Faodel relies heavily onRDMA to transfer data throughout an HPC system. RDMA elimi-nates the need to copy user data objects to kernel buffers and allowsdata transfers to occur without CPU intervention. RDMA transfersrequire the user register the memory buffers that are the source ordestination of the transfer with the underlying network transport.Additionally, the virtual memory space that contains the memorybuffer must be locked (or pinned) by the kernel to prevent it frombeing relocated in physical memory. The costs of registering andde-registering memory vary by network transport.

Page 3: Faodel: Data Management for Next-Generation Application ... · Craig Ulmer, Shyamali Mukherjee, Gary Templet, Scott Levy, Jay Lofstead, Patrick Widener, Todd Kordenbrock, and Margaret

Faodel: Data Management for Next-Generation Application Workflows IWAC 2018, June 12, 2018, Phoenix, AZ

BSP MPI Job

PlatformResourceManagerJob

1:1:0

Data Caching Job

AMT Job

ISAV Jobs

Figure 1: High-level diagram showing the relationship of Faodel to a workflow comprised of an AMT application and in situanalysis and visualization. The large colored shapes encompass the resources used by each entity. The small squares representcomputational resources (e.g., compute nodes, processors, or threads). The cylinders represent storage resources.

Lunasa amortizes these costs by facilitating the explicit reuse ofregistered memory. Existing approaches to reuse registered mem-ory, see e.g., [4, 8, 15], do so implicitly. In other words, instead ofde-registering memory buffers at the end of an RDMA operation,the registered memory is left registered and its registration infor-mation is cached. Subsequent requests to re-register the cachedmemory can be fulfilled without incurring the costs of registration.1As a result, application programmers are encouraged to managetheir memory buffers to extract as much reuse as possible. Lunasafrees application programmers from the burden of managing theseimplicit reuse semantics and allows them to explicitly request mem-ory allocations for use in RDMA operations. To accomplish thisobjective Lunasa requests blocks of memory from the system andregisters them at the outset. Lunasa then manages the resultingpool of registered memory as a standalone resource for satisfyinguser requests.

3 CASE STUDY: SIMPLEPIC

In this section we describe a sample application use case of Faodelservices, enabling in situ analytics in a computational science code.

Particle in Cell (PIC) methods are a class of well-establishedcomputational techniques for simulating plasmas. Plasmas are com-prised of charged particles in gaseous form interacting with eachother and the surrounding environment. PIC methods simulateplasmas by computing: 1) electromagnetic (EM) fields created bycharged particles in the plasma, 2) chemical reactions of plasmaparticles with the environment, and 3) motion of particles due toforces exerted by the EM fields. This last step generates currentsand charges which further drive the EM field. Sandia is developing

I It is also worth noting that caching the registered memory eliminates the de-registration costs that would otherwise be incurred at the end of an RDMA operation.

a new, PIC code named EMPIRE that is being architected to scale tonext-generation computing platforms. While EMPIRE is currentlyimplemented as a BSP-style MPI code, developers are evaluatingwhether better load balancing can be achieved through a DARMA-based AMT implementation. In order to allow researchers to betterexplore AMT tradeoffs, the EMPIRE developers have constructeda reference application called SimplePIC that isolates the particlemove phase. This phase is especially appealing for an AMT envi-ronment because the move routinely causes imbalances in the waythe data is distributed across compute nodes.

Whether in a BSP or AMT setting, PIC simulations generate asubstantial amount of data that can make analysis challenging fortwo reasons. First, the volume of data managed by a simulation islarge enough that writing to disk and performing post processingis infeasible. This characteristic drives the need for ISAV tools thatcan summarize current conditions in a way that is meaningful forusers. Second, efficient PIC applications periodically redistributetheir datasets to achieve better load balancing as particles disperseover time in the simulation. While this load balancing is performedmanually in the BSP case and automatically in the AMT case, theend result is that downstream ISAV applications need a mechanismfor locating and retrieving particles. Faodel performs this functionin both cases. Examples of ISAV applications include tasks such asquantifying how many particles are close to regions of interest in themesh, identifying how many particles exceed a threshold velocity,and rendering images to help developers verify the simulation ismodeling an environment correctly.

In order to provide a portable way for connecting PIC and ISAVapplications through Faodel, we have constructed a Particle DIM(Data Interface Module) that is usable by both producers and con-sumers of PIC data. From the producer perspective, an application

Page 4: Faodel: Data Management for Next-Generation Application ... · Craig Ulmer, Shyamali Mukherjee, Gary Templet, Scott Levy, Jay Lofstead, Patrick Widener, Todd Kordenbrock, and Margaret

IWAC 2018, June 12, 2018, Phoenix, AZ

task periodically generates a patch of particle data that is injectedinto Faodel through the task's Particle DIM. The Particle DIM seri-alizes the patch data into one or more key/blob pairs that are thenpublished into Faodel's resources. The unique key names for patchdata are also published as metadata in Faodel to provide a wayfor downstream applications to locate data. From the consumer'sperspective, an application uses the Particle DIM to query metadataand retrieve relevant particle patches.From the programmer's perspective, there are multiple advan-

tages to using a DIM to interface with Faodel. First, a DIM estab-lishes a contract between producers and consumers about howdata is exchanged, but does not dictate how those transfers areimplemented. This property allows DIM developers to write multi-ple implementations that decompose datasets in different ways ifneeded. Second, a DIM enables simulations to be decoupled fromISAV applications while retaining the ability to leave data objectsin place if needed. The Particle DIM can be configured to storedata objects in the application or to other distributed resources thatare part of Faodel. Finally, a DIM provides a mechanism for imple-menting indexing that's right for the application. While the currentParticle DIM performs basic indexing to locate items, it can easilybe extended to allow consumers to make advanced queries. Ourapproach in this work is to have data producers compute statisticson data as it is inserted and store the information as additionalmetadata in Faodel.

4 EVALUATION

4.1 Opbox

One system that makes use of the Faodel infrastructure, and ofOpbox in particular, is the custom metadata management systemEMPRESS [7]. It allows clients to create custom attributes on partsof a variable, the whole variable, or even the whole data set toaccelerate reading for data analysis. The two fundamental compo-nents of EMPRESS are a client API and a set of dedicated metadataservers. Opbox provides the connection between clients and servers.Each EMPRESS function has an underlying Op that passes the re-quest and related metadata between the clients and servers. Theperformance of these Ops was tested using 100 write clients, 10read clients, and one EMPRESS server. The data storage mecha-nism for the EMPRESS servers is currently a SQLite in-memorydatabase. The evaluation results presented were obtained on theChama capacity cluster at Sandia.

Testing indicates that each of the fundamental writing and read-ing Ops are efficiently supported as indicated in Figures 2 and 3.All times are from the client perspective, include the round tripcommunication and database operation times, and are the averageof 5 experiments. In each experiment, each client performs oneiteration of read or write operations. The operations demonstratedfor writing are (0) "activating" an output set (committing a trans-action), (1) the metadata associated with the entire run overall, (2)information for the current timestep, (3) the metadata for a singleglobally distributed variable, (4) creating a new, custom attributetype, (5) attaching an attribute to a run, (6) attaching an attributeto a timestep, and (7) attaching an attribute to part of a variable.For reading, the query returns (0) information about the run, (1)what timesteps are written for the run, (2) what variables are in a

Ulmer et al.

particular timestep, (3) what attributes have been attached to therun, (4) what attributes have been attached to the timestep, (5) whatattributes have been attached to the variables, (6) attributes of agiven type for a variable, and (7) attributes for a variable.

Write Ops: Chama

0.100 -

0.010 -

0 001

0.0100

0.0010

-T•

4 5

0: Activate1: Run2: Timestep3: Var4: Type5: Run Attr6: Timestep Attr7: Var Attr

Figure 2: Writing operation total times

Read (Catalog) Ops: Chama

0: Runs1: Tirnesteps in Run2: Vars in Timestep3: Run Attrs4: Tirnestep Attrs5: Var Attrs6: Var Attrs of Type& Var

7: Var Attrs of Var

Figure 3: Reading operation total times

4.2 SimplePIC analytics

To demonstrate the feasibility of Faodel, we implemented a sim-ple visualization example using SimplePIC and VTK [5]. In thisexample, particle data from SimplePIC are inserted into Faodel. Thevisualization application then retrieves data from Faodel and usesit to construct VTK objects for the purpose of generating a visualrepresentation of the data.The results of this experiment are shown in Figure 4. This pre-

liminary evaluation is intended to demonstrate functionality andfeasibility rather than performance. This figure shows the time re-quired to move groups of particle data as a function of the numberof particles being moved. The blue line shows the cost of insertinga block of particle data into Faodel. Because Faodel is designedto exchange data using shallow copies of reference-counted datastructures, the time required to insert a block of particle data isindependent of the number of particles for which data are beinginserted (displayed as insignificant overhead in Figure 4).

The orange line in Figure 4 shows the cost of reading particledata from Faodel and generating VTK objects from its contents.In this case, the time required increases with the size of the databeing manipulated. This is due to the fact that using particle datato create VTK objects currently requires a deep copy of the dataread from Faodel.

Page 5: Faodel: Data Management for Next-Generation Application ... · Craig Ulmer, Shyamali Mukherjee, Gary Templet, Scott Levy, Jay Lofstead, Patrick Widener, Todd Kordenbrock, and Margaret

Faodel: Data Management for Next-Generation Application Workflows IWAC 2018, June 12, 2018, Phoenix, AZ

8

7

6

5E7j 4E

2

0

- -•- Producer Injection

-0- Consumer Ejection

Overhead

to VTK Overhead

- -

• • II-

101 102 103 104

Number of Particles

Figure 4: Preliminary data demonstrating the costs of mov-ing data into and out of Faodel

5 RELATED WORK

The data management services provided by Faodel can improve theefficiency, speed, and maintainability of key computational scienceapplications. An important body of existing research has examinedmany similar services.

5.1 Asynchronous Many-Task RuntimeSystems

Asynchronous Many-Task (AMT) runtime systems promise to elim-inate many of the scaling issues that have emerged in tightly-coupled, synchronous applications by allowing independent tasksto execute as soon as their data dependencies are satisfied. Becausepredicting where and when a task may execute in an AMT pro-gramming model is a formidable task, AMT runtimes depend onservices to provide efficient data movement. Meng et al. [11] de-scribe how the Uintah "Data Warehouse" allows tasks to accessUintah variables. Each variables is stored in a dictionary with awell-defined key. Tasks are only allowed to work with variablesstored in their local memory and variables acquired from the DataWarehouse. Faodel is aimed at providing job-to-job communicationin these environments.

5.2 Data Management in Scientific Workflows

Many modern scientific simulations require multiple phases ofexecution to arrive at a solution. Efficient execution of these simula-tions depends on robust data management services for exchanging(or sharing) simulation data among the applications that comprisethe workflow. Dataspaces [2] uses distributed hash tables and lever-age advanced interconnect services (e.g., RDMA) to exchange databetween applications (e.g., coupled physics codes). DDSS [16] is adistributed service for sharing data in a datacenter environment.The key abstraction in this approach is a logical shared memoryregion referenced by a unique key. Sharing data is accomplishedby sharing the key associated with the data. Support for coherenceand access control (e.g., granting exclusive access) are provided.Other research projects, e.g., Kepler [10] and Taverna [12], have

focused on developing tools that enable domain scientists to quicklyand easily assemble scientific workflows. Faodel is focusing to de-liver similar capabilities to applications using other programmingmodels such as AMT.

5.3 Data Analysis

Many scientific simulations have the potential to generate vastquantities of output data. Domain scientists rely on sophisticatedanalysis and visualization to make sense of these data. Efficient useof these tools requires robust data management services to findand access output datasets. Pavlo et al. [14] compare the use ofMapReduce and Parallel Database Management Systems (DBMS)for analyzing large volumes of data. For both of these approaches,data is stored and exchanged through the filesystem. SENSEI [1]defines a generic data model to facilitate the transfer of data be-tween simulation and analysis tasks. Their generic data model isintended to simplify the process of combining a simulation codewith different kinds of analysis.

5.4 Unified Storage Hierarchy

HPC systems have recently experienced significant growth in thedepth of their storage hierarchy. SSDs, NVRAM, and 3D-stackedDRAM are all becoming increasingly common. Because each levelof the hierarchy represents a different set of tradeoffs (and possi-bly also, different programming interfaces), application developersface an increasingly complex set of choices when decided wheretheir data should reside. UNITY [6] provides a single interface forapplications to access all levels of storage hierarchy. Data Eleva-tor [3] provides a transparent mechanism for moving data amongdifferent layers of the storage hierarchy. Specifically, the authorsdescribe and demonstrate their approach to moving data betweenburst buffers and the parallel file system.

6 CONCLUSION

This paper describes our work to date on Faodel, a set of serviceswe envision as a data backplane for applications in computationalscience workflows. Using Faodel, applications in workflows will beable to exchange data without necessarily involving a file system,while structuring those exchanges according to semantics whichare meaningful to the developers involved. We have demonstratedthe generality and efficiency of the Faodel software stack in severalapplication use cases, two of which we have presented here. Ourinitial evaluations have been intended to demonstrate feasibilityfor these examples. Our future plans for Faodel include evaluationsin more complex workflows with larger computational and datatransfer requirements. We are also investigating integrations withworkflow definition and deployment tools under development byother researchers at Sandia.

ACKNOWLEDGMENTS

Sandia National Laboratories is a multimission laboratory managedand operated by National Technology and Engineering Solutions ofSandia, LLC, a wholly owned subsidiary of Honeywell International,Inc., for the U.S. Department of Energy's National Nuclear SecurityAdministration under contract DE-NA0003525.

Page 6: Faodel: Data Management for Next-Generation Application ... · Craig Ulmer, Shyamali Mukherjee, Gary Templet, Scott Levy, Jay Lofstead, Patrick Widener, Todd Kordenbrock, and Margaret

IWAC 2018, June 12, 2018, Phoenix, AZ Ulmer et al.

REFERENCES[1] Utkarsh Ayachit, Brad Whitlock, Matthew Wolf, Burlen Loring, Berk Geveci,

David Lonie, and E Wes BetheL 2016. The SENSEI generic in situ interface. In InSitu Infrastructures for Enabling Extreme-Scale Analysis and Visualization (ISAV),Workshop on. IEEE, 40-44.

[2] Ciprian Docan, Manish Parashar, and Scott Klasky. 2010. Dataspaces: an in-teraction and coordination framework for coupled simulation workflows. hiProceedings of the 19th ACM International Symposium on High Performance Dis-tributed Computing. ACM, 25-36.

[3] Bin Dong, Suren Byna, Kesheng Wu, Hans Johansen, Jeffrey N Johnson, NoelKeen, et al. 2016. Data Elevator: Low-Contention Data Movement in Hierar-chical Storage System. In High Performance Computing (HiPC), 2016 IEEE 23rdInternational Conference on. IEEE, 152-161.

[4] Richard L Graham, Timothy S Woodall, and Jeffrey M Squyres. 2005. Open MPI: Aflexible high performance MPI. In International Conference on Parallel Processingand Applied Mathematics. Springer, 228-239.

[5] Kitware Inc. 2018. VTK - The Visualization Toolkit. http://www.vtk.org. (2018).Accessed: 2018-04-03.

[6] Terry Jones, Michael J Brim, Geoffroy Vallee, Benjamin Mayer, Aaron Welch,Tonglin Li, Michael Lang, Latchesar Ionkov, Douglas Otstott, Ada Gavrilovska,et al. 2017. UNITY: Unified Memory and File Space. In Proceedings of the 7thInternational Workshop on Runtime and Operating Systems for Supercomputers(ROSS 2017). ACM, 6.

[7] Margaret Lawson, Craig Ulmer, Shyamali Mukherjee, Gary Templet, Jay Lofstead,Scott Levy, Patrick Widener, and Todd Kordenbrock. 2017. Empress: ExtensibleMetadata Provider for Extreme-scale Scientific Simulations. In Proceedings ofthe 2nd joint International Workshop on Parallel Data Storage & Data IntensiveScalable Computing Systems (PDSW-DISCS '17). ACM, New York, NY, USA, 19-24.https://doi.org/10.1145V3149393.3149403

[8] Jiuxing Liu, Jiesheng Wu, and Dhabaleswar K Panda. 2004. High performanceRDMA-based MPI implementation over InfiniBand. International journal ofParallel Programming 32, 3 (2004), 167-198.

[9] Jay Lofstead, Ron Oldfield, Todd Kordenbrock, and Charles Reiss. 2011. ExtendingScalability of Collective I/0 Through Nessie and Staging. In Proceedings of the6th Parallel Data Storage Workshop (PDSW '11). Seattle, WA, 7-12. https://doi.org/10.1145/2159352.2159355

[10] Bertram Luddscher, Ilkay Altintas, Chad Berkley, Dan Higgins, Efrat Jaeger,Matthew Jones, Edward A Lee, Jing Tao, and Yang Zhao. 2006. Scientific workflowmanagement and the Kepler system. Concurrency and Computation: Practice andExperience 18, 10 (2006), 1039-1065.

[11] Qingyu Meng, Justin Luitjens, and Martin Berzins. 2010. Dynamic task schedulingfor the uintah framework. In Many-Task Computing on Grids and Supercomputers(MTAGS), 2010 IEEE Workshop on. IEEE, 1-10.

[12] Tom Oinn, Matthew Addis, Justin Ferris, Darren Marvin, Martin Senger, MarkGreenwood, Tim Carver, Kevin Glover, Matthew R Pocock, Anil Wipat, et al. 2004.Taverna: a tool for the composition and enactment of bioinformatics workflows.Bioinformatics 20, 17 (2004), 3045-3054.

[13] Ron A. Oldfield, Patrick Widener, Arthur B. Maccabe, Lee Ward, and Todd Ko-rdenbrock. 2006. Efficient Data-Movement for Lightweight 1/0. In Proceed-ings of the 2006 International Workshop on High Performance I/0 Techniquesand Deployment of Very Large Scale I/0 Systems. Barcelona, Spain. https://doi.org/10.1109/CLUSTR.2006.311897

[14] Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J Abadi, David J DeWitt,Samuel Madden, and Michael Stonebraker. 2009. A comparison of approaches tolarge-scale data analysis. In Proceedings of the 2009 ACM SIGMOD InternationalConference on Management of data. ACM, 165-178.

[15] Hiroshi Tezuka, Francis O'Carroll, Atsushi Hori, and Yutaka Ishikawa. 1998.Pin-down cache: A virtual memory management technique for zero-copy com-munication. In Proceedings of the First Merged International Parallel ProcessingSymposium and Symposium on Parallel and Distributed Processing(IPPS/SPDP)1998. IEEE, 308-314.

[16] Karthikeyan Vaidyanathan, Sundeep Narravula, and Dhabaleswar K Panda. 2006.DDSS: a low-overhead distributed data sharing substrate for cluster-based data-centers over modem interconnects. Lecture notes in computer science 4297 (2006),472.