parameterized speciﬁcation, conﬁguration and execution of ...€¦ · for data-intensive...

Cluster Comput (2010) 13: 315–333DOI 10.1007/s10586-010-0133-8

Parameterized specification, configuration and executionof data-intensive scientific workflows

Vijay S. Kumar · Tahsin Kurc · Varun Ratnakar · Jihie Kim · Gaurang Mehta ·Karan Vahi · Yoonju Lee Nelson · P. Sadayappan · Ewa Deelman · Yolanda Gil ·Mary Hall · Joel Saltz

Received: 7 November 2009 / Accepted: 16 March 2010 / Published online: 10 April 2010© Springer Science+Business Media, LLC 2010

Abstract Data analysis processes in scientific applicationscan be expressed as coarse-grain workflows of complex dataprocessing operations with data flow dependencies betweenthem. Performance optimization of these workflows can beviewed as a search for a set of optimal values in a multidi-mensional parameter space consisting of input performanceparameters to the applications that are known to affect theirexecution times. While some performance parameters suchas grouping of workflow components and their mapping tomachines do not affect the accuracy of the analysis, othersmay dictate trading the output quality of individual compo-nents (and of the whole workflow) for performance. This pa-per describes an integrated framework which is capable ofsupporting performance optimizations along multiple suchparameters. Using two real-world applications in the spa-tial, multidimensional data analysis domain, we present anexperimental evaluation of the proposed framework.

Keywords Scientific workflow · Performance parameters ·Semantic representations · Grid · Application QoS

V.S. Kumar (�) · P. SadayappanDept. of Computer Science and Engineering, Ohio StateUniversity, Columbus, OH 43210, USAe-mail: [email protected]

V. Ratnakar · J. Kim · G. Mehta · K. Vahi · Y.L. Nelson ·E. Deelman · Y. GilInformation Sciences Institute, University of Southern California,Marina del Rey, CA 90292, USA

M. HallSchool of Computing, University of Utah, Salt Lake City,UT 84112, USA

T. Kurc · J. SaltzCenter for Comprehensive Informatics, Emory University,Atlanta, GA 30322, USA

1 Introduction

Advances in digital sensor technology and the complex nu-merical models of physical processes in many scientific do-mains are bringing about the acquisition of enormous vol-umes of data. For example, a dataset of high resolution im-age data obtained from digital microscopes or large scalesky telescopes can easily reach hundreds of Gigabytes, evenmultiple Terabytes in size.1 These large data volumes aretransformed into meaningful information via data analysisprocesses. Analysis processes in scientific applications areexpressed in the form of workflows or networks of inter-dependent components, where each component correspondsto an application-specific data processing operation. Imagedatasets, for instance, are analyzed by applying workflowsconsisting of filtering, data correction, segmentation, andclassification steps. Due to the data and compute intensivenature of scientific data analysis applications, scalable so-lutions are required to achieve desirable performance. Soft-ware systems supporting the analysis of large scientific data-sets implement several optimization mechanisms to reduceexecution times. First, workflow management systems takeadvantage of distributed computing resources in the Grid.The Grid environment provides computation and storage re-sources; however, these resources are often located at dis-parate sites managed within different security and admin-istrative domains. Workflow systems support execution ofworkflow components at different sites (Grid nodes) and re-liable, efficient staging of data across the Grid nodes. A Gridnode may itself be a large cluster system or a potentially

1DMetrix array microscopes can scan a slide at 20x+ resolution inless than a minute. The Large Synoptic Survey Telescope will be ableto capture a 3.2 Gigapixel image every 6 seconds, when it is activated.

mailto:[email protected]

316 Cluster Comput (2010) 13: 315–333

heterogeneous and dynamic collection of machines. Second,for each portion of a workflow mapped to such clusters, theyenable the fine-grain mapping and scheduling of tasks ontosuch machines.

The performance of workflows is greatly affected by cer-tain parameters to the application that direct the amount ofwork to be performed on a node or the volume of data to beprocessed at a time. The optimal values of such parameterscan be highly dependent on the execution context. There-fore, performance optimization for workflows can be viewedas a search for a set of optimal values in a multidimen-sional parameter space, given a particular execution con-text. Workflow-level performance parameters include group-ing of data processing components comprising the workflowinto ‘meta-components’, distribution of components acrosssites and machines within a site, and the number of instancesof a component to be executed. These parameters impactcomputation, I/O, and communication overheads, and as aresult, the total execution time. Another means of improvingperformance is by adjusting component-level performanceparameters in a workflow. An example of such a parameteris the data chunk size in applications which analyze spatial,multidimensional datasets. Another example is the versionof the algorithm employed by a component to process thedata.

We classify workflow-level and component-level perfor-mance parameters into two categories:

(i) Accuracy-preserving parameters (such as data chunksize) can affect the performance of an operation without af-fecting the quality of the analysis output, and (ii) Accuracy-trading parameters can trade the quality of the outputfor performance gains, and vice-versa. An example of anaccuracy-trading parameter is the ‘resolution’ at which im-age data are processed. A classification algorithm mightprocess low-resolution images quickly, but its classifica-tion accuracy would likely be lower compared to thatfor higher resolution images. When optimizations involveaccuracy-performance trade-offs, users may supplementtheir queries with application-level quality-of-service (QoS)requirements that place constraints on the accuracy of theanalysis [17]. For instance, when images in a dataset areprocessed at different resolutions to speed up the classifica-tion process, the user may request that a certain minimumaccuracy threshold be achieved.

In this paper, we describe the design and implementa-tion of a framework that can support application executionin a distributed environment and enable performance op-timization via manipulation of accuracy-preserving and/oraccuracy-trading parameters. We present an instance of ourframework that integrates multiple subsystems at differentlevels:

– Wings [12] to facilitate high-level, semantic descriptionsof application workflows

– Pegasus [11], Condor [23], and DataCutter [2] to supportscalable workflow execution across multiple institutionsand on distributed clusters within an institution

– ECO [5] to enable compiler optimizations for fine-graincomputations executing on specific resources

Application developers and end-users can use our frame-work to provide high-level, semantic descriptions of appli-cation structure and data characteristics. As our initial fo-cus is on addressing performance requirements in spatial,multidimensional data analysis applications, we have devel-oped extensions to core ontologies in Wings to be able to de-scribe spatial datasets and also to enable automatic compo-sition and validation of the corresponding workflows. Oncea workflow has been specified, users can adjust workflow-level and component-level parameters based on their QoSrequirements to enable performance optimizations duringexecution. As part of this effort, we have also extended Con-dor’s default job-scheduling mechanism to support perfor-mance optimizations stemming from accuracy-performancerelated trade-offs. We show how our framework supportsparameter-based optimizations for real biomedical imageanalysis workflows using two cluster systems located at twodifferent departments at the Ohio State University.

2 Related work

Workflow management systems for the Grid and ServicesOriented Architectures, such as Taverna [22], Kepler [19]and Pegasus [11] seek to minimize the makespan by ma-nipulating workflow-level parameters such as grouping andmapping of a workflow’s components. Our framework ex-tends such support by providing the combined use of task-and data-parallelism and data streaming within each compo-nent and across multiple components in a workflow to fullyexploit the capabilities of Grid sites that are high-end clus-ter systems. Glatard et al. [13] describe the combined useof data parallelism, services parallelism and job groupingfor data-intensive application service-based workflows. Ourwork is in the context of task-based workflows where exe-cution plans are developed based on abstract workflow de-scriptions. We also address performance improvements byadjusting domain-specific component-level parameters.

The Common Component Architecture (CCA) forum2

addresses domain-specific parameters for components andthe efficient coupling of parallel scientific components. Theyseek to support performance improvements through the useof external tunability interfaces [4, 21]. The Active Harmonysystem [8, 9] is an automatic parameter tuning system thatpermits on-line rewriting of parameter values at run-time,

2http://www.cca-forum.org.

http://www.cca-forum.org

Cluster Comput (2010) 13: 315–333 317

and uses a simplex method to guide the search for opti-mal parameter values. Although we share similar motiva-tions with the above works, we target data-intensive appli-cations running on the Grid. We account for performancevariations brought about by the characteristics of dataset in-stances within a domain.

Our work also supports application-level QoS require-ments by tuning accuracy-trading parameters in the work-flows. The performance/quality trade-off problem and tun-ing of quality-trading parameters for workflows has beenexamined before in service-based workflows [3] and compo-nent-based systems [10]. But these works are geared to-wards system-level QoS and optimization of system-relatedmetrics such as data transfer rates, throughput and serviceaffinity etc. Application-level QoS for workflows has beenaddressed in [6, 26]. Acher et al. [1] describe a services-oriented architecture that uses ontology-based semanticmodeling techniques to address the variability in results pro-duced by Grid services for medical imaging. We supporttrade-offs based on quality of data output from the applica-tion components and integrate such parameter tuning withina batch system like Condor.

Supporting domain-specific parameter-based optimiza-tions requires the representation of these parameters andtheir relations with various performance and quality metricsin a system-comprehensible manner. In [6], end-users arerequired to provide performance and quality models of ex-pected application behavior to the system. Ontological rep-resentations of performance models have been investigatedin the context of workflow composition in the Askalon sys-tem [24]. Lera et al. [18] proposed the idea of developingperformance-related ontologies that can be queried and rea-soned upon to analyze and improve performance of intelli-gent systems. Zhou et al. [26] used rule-based systems toconfigure component-level parameters. While beyond thescope of this paper, we seek to complement our frameworkwith such approaches in the near future.

3 Motivating applications

Our work is motivated by the requirements of applicationsthat process large spatial, multidimensional datasets. We usethe following two application scenarios from the biomedicalimage analysis domain in our evaluation. Each applicationhas different characteristics and end-user requirements.

3.1 Application 1: Pixel Intensity Quantification (PIQ)

Figure 1 shows a data analysis pipeline [7] (developed byneuroscientists at the National Center for Microscopy andImaging Research) for images obtained from confocal mi-croscopes. This analysis pipeline quantifies pixel intensitywithin user-specified polygonal query regions of the imagesthrough a series of data correction steps as well as threshold-ing, tessellation, and prefix sum generation operations. Thisworkflow is employed in studies that involve comparison ofimage regions obtained from different subjects as mapped toa canonical atlas (e.g., a brain atlas). From a computationalpoint of view, the main end-user requirements are (1) to min-imize the execution time of the workflow while preservingthe highest output quality, and (2) to support the executionof potentially terabyte-sized out-of-core image data.

3.2 Application 2: Neuroblastoma Classification (NC)

Figure 2 shows a multi-resolution based tumor prognosispipeline [14] (developed by researchers at the Ohio StateUniversity) applied to images from high-power light mi-croscopy scanners. This workflow is employed to clas-sify image data into grades of neuroblastoma, a commonchildhood cancer. Our primary goal is to optimally sup-port user queries while simultaneously meeting a wide rangeof application-level QoS requirements. Examples of suchqueries include: “Minimize the time taken to classify imageregions with 60% accuracy” or “Determine the most accu-rate classification of an image region within 30 minutes,with greater importance attached to feature-rich regions”.

Fig. 1 Pixel IntensityQuantification (PIQ) workflow

318 Cluster Comput (2010) 13: 315–333

Fig. 2 NeuroblastomaClassification (NC) workflow

Here, accuracy of classification and richness of features areapplication domain-specific concepts and depend on the res-olution at which the image is processed. In an earlier work,we developed heuristics that exploit the multi-resolutionprocessing capability and the inherent spatial locality of theimage data features in order to provide improved responsesto such queries [17].

The aforementioned applications differ in their workflowstructure and also the complexity of their data analysis oper-ations. The NC workflow processes a portion or chunk ofa single image at a time using a sequence of operations.The end-result for an image is an aggregate of results ob-tained from each independently processed chunk. PIQ, onthe other hand, contains complex analysis operations such asaggregations, global computations and joins across multipledatasets, that are not easily scalable. Hence, such applica-tions require parallel algorithms for efficient processing ofout-of-core data.

4 Performance optimizations

In this section we discuss several strategies for improvingworkflow performance. Drawing from the application sce-narios, we also present parameters that impact the perfor-mance of applications in the spatial, multidimensional dataanalysis domain. We classify these parameters into two maincategories, accuracy-preserving parameters and accuracy-tr-ading parameters, and explain how they can influence per-formance and/or quality of output.

4.1 Accuracy-preserving parameters

Chunking strategy Individual data elements (e.g. images)in a scientific dataset may be larger than the physical mem-ory available on current workstations. Relying on virtual

memory alone is likely to yield poor performance. In gen-eral, the processing of large, out-of-core spatial, multidi-mensional data is supported by partitioning it into a set ofdata chunks, each of which can fit in memory, and by mod-ifying the analysis routines to operate on chunks of data ata time. Here, a data chunk provides a higher-level abstrac-tion for data distribution and is the unit of disk I/O. In thesimplest case, we can have a uniform chunking strategy, i.e.all chunks have the same shape and size. For 2-D data, thisparameter is represented by a pair [W , H], where W andH respectively represent the width and height of a chunk.In our work, we use this simplified strategy and refer tothis parameter as chunksize. The chunksize parame-ter can influence application execution in several ways. Thelarger a chunk is, the greater the amount of disk I/O andinter-processor communication for that chunk will likely be,albeit the number of chunks will be smaller. The chunk-size affects the number of disk blocks accessed and net-work packets transmitted during analysis. However, largerchunks imply a decreased number of such chunks to beprocessed, and this could in turn, decrease the job schedul-ing overheads. Moreover, depending on the levels of mem-ory hierarchy and hardware architecture present on a com-pute resource, the chunksize can affect the number ofcache hits/misses for each component and thus, the over-all execution time. For the PIQ workflow, we observed thatvarying chunksize resulted in differences in executiontime. Moreover, the optimal chunksize for one compo-nent may not be optimal for other components; some com-ponents prefer larger chunks, some prefer square-shapedchunks over thin-striped chunks, while others may functionindependent of the chunksize.

Component configuration Components in a workflow co-uld have many algorithmic variants. Traditionally, one ofthese variants is chosen based on the type of the inputdata to the component and/or the type of the output data

Cluster Comput (2010) 13: 315–333 319

expected from the component. However, choosing an al-gorithmic variant can affect the application performancebased on resource conditions/availability and data charac-teristics, even when each variant performs the analysis dif-ferently but produces the same output and preserves theoutput quality. In an earlier work [15], we developed threeparallel-algorithmic variants for the warp component of thePIQ workflow. We observed that depending on the availableresources—clusters nodes with faster processors or clustersequipped with high-speed interconnects or nodes offeringhigher disk I/O rates—each variant was capable of outper-forming the other, and no single variant performed best un-der all resource conditions.

Task granularity and execution strategy A workflow mayconsist of many components. If a chunk-based processing ofdatasets is employed, creating multiple copies of a compo-nent instance may speed up the process through data paral-lelism. How components and component copies are sched-uled, grouped, and mapped to machines in the environmentwill affect the performance, in particular if the environmentconsists of a heterogeneous collection of computational re-sources. An approach could be to treat each (component in-stance, chunk) pair as a task and each machine in the en-vironment as a Grid site. This provides a uniform mech-anism for execution within a Grid site as well as acrossGrid sites. It also enables maximum flexibility in using jobscheduling systems such as Condor [23]. However, if thenumber of component instances and chunks is large, then thescheduling and data staging overheads may assume signif-icance. An alternative strategy is to group multiple compo-nents into meta-components and map/schedule these meta-components to groups of machines. Once a meta-componentis mapped to a group of machines, a combined task- anddata-parallelism approach with pipelined dataflow style ex-ecution can be adopted within the meta-component. Whenchunking is employed, the processing of chunks by suc-cessive components (such as thresholding and tessellationin the PIQ workflow) can be pipelined such that, when acomponent C 1 is processing a chunk i, then the downstreamcomponent C 2 can concurrently operate on chunk i + 1 ofthe same data element. A natural extension to pipelining isthe ability to stream data between successive workflow com-ponents mapped to a single Grid site, so that the intermediatedisk I/O overheads are avoided.

4.2 Accuracy-trading parameters

Chunking strategy The chunksize parameter may alsofunction as an accuracy-trading performance parameter incases where analysis operations depend on the feature con-tent within a data chunk. In the NC workflow, chunksizeaffects both the performance and accuracy of the analysis.

The smaller the chunksize is, the faster the analysis ofeach chunk. However, small chunksize values (wherechunks are smaller than the features being analyzed) couldlead to false-negatives. If the chunk is too large, then extra-neous features and noise within a chunk could impact theaccuracy of analysis.

Data resolution Spatial data has an inherent notion of dataquality. The resolution parameter for spatial, multidi-mensional data takes values from 1 to n, where n representsthe data at its highest quality. If a multi-resolution analy-sis approach is adopted, then the processing of data chunksat different resolutions will produce output of varyingquality. Execution time increases with resolution be-cause higher resolutions contain more data to be processed.In some cases, data may need to be processed only at thelowest target resolution that can yield a result of adequatequality. It is assumed that the accuracy of analysis is maxi-mum when data is processed at its highest resolution.

Processing order The processing order parameterrefers to the order in which data chunks are operated uponby the components in a workflow. In the NC workflow, theaccuracy of analysis computed across an entire image is ob-tained as an aggregate of the accuracy of analysis for eachchunk in the image. As aggregation is not always commuta-tive, the order in which chunks are processed will affect theaccuracy of analysis when computed across the entire im-age. Our previous work [17] with the NC workflow showedhow out-of-order processing of chunks (i.e., selecting a sub-set of “favorable” chunks ahead of other chunks) in an im-age could be used to improve response (by upto a factor of40%) to user queries that contain various quality-of-servicerequirements.

5 Workflow composition and execution framework

In this section, we describe our framework to support spec-ification and execution of data analysis workflows. Theframework consists of three main modules. The descrip-tion module implements support for high-level specifica-tion of workflows. In this module, the application structureand data characteristics for the application domain are pre-sented to the system. This representation is independent ofactual data instances used in the application and the com-pute resources on which the execution is eventually car-ried out. The execution module is responsible for work-flow execution and takes as input, the high-level descriptionof the workflow produced by the description module, thedatasets to be analyzed and the target distributed executionenvironment. Lastly, the trade-off module, implements run-time mechanisms to enable accuracy-performance trade-offs

320 Cluster Comput (2010) 13: 315–333

Fig. 3 Framework architecture

based on user-specified quality-of-service requirements andconstraints. The architecture of the framework is illustratedin Fig. 3. In this paper, we evaluated a specific instance ofthis framework where the representative systems for eachmodule are highlighted using blue boxes in the figure.

5.1 Description module (DM)

The Description Module (DM) is implemented using theWings (Workflow Instance Generation and Selection) sys-tem [12]. In the Wings representation of a scientific work-flow, the building blocks of a workflow are components anddata types. Application domain-specific components are de-scribed in component libraries. A component library speci-fies the input and output data types of each component andhow metadata properties associated with the input data typesrelate to those associated with the output for each compo-nent. The data types themselves are defined in a domain-specific data ontology. Wings allows users to describe anapplication workflow using semantic metadata propertiesassociated with workflow components and data types at ahigh level of abstraction. This abstraction is known as aworkflow template. The workflow template and the seman-tic properties of components and data types are expressed

using the Web Ontology Language(OWL).3 A template ef-fectively specifies the application-specific workflow compo-nents, how these components are connected to each other toform the workflow graph, and the type of data exchangedbetween the components. Figure 4(a) is a depiction of aworkflow template constructed for the PIQ application. Theworkflow template is data-instance independent; it specifiesdata types consumed and produced by the components butnot particular datasets. Wings can use the semantic descrip-tions to automatically validate a given workflow, i.e., if twocomponents are connected in the workflow template, Wingscan check whether output data types and properties of thefirst component are consistent with the input data types andproperties of the second component. Given a workflow tem-plate, the user can specify a data instance (e.g., an image) asinput to the workflow and the input argument values to eachcomponent in the workflow. Using the metadata propertiesof the input datasets, Wings can automatically generate adetailed specification of the workflow tasks and data flow inthe form of a DAG referred to as an expanded workflow in-

3http://www.w3.org/TR/owl-ref.

http://www.w3.org/TR/owl-ref

Cluster Comput (2010) 13: 315–333 321

Fig. 4 PIQ application workflow represented using Wings

stance for execution. Figure 4(b) shows a workflow instancegenerated by Wings from the PIQ workflow template.

Domain-specific data ontology Wings provides “core” on-tologies that can describe abstract components and data

types. For any new application domain, these core ontolo-gies can be extended to capture domain-specific informa-tion. Semantic descriptions for new data types and compo-nents are maintained in domain-specific ontologies. Work-flow templates for applications within that domain can then

322 Cluster Comput (2010) 13: 315–333

Fig. 5 Wings data ontology extensions for spatial, multidimensionaldata

be constructed using these ontologies. The core data ontol-ogy in Wings contains OWL-based descriptions of genericdata concepts such as File, Collection of Files ofthe same type, and CollOfCollections, as well astheir metadata properties. However, our motivating applica-tions process spatial, multidimensional data by partitioningdatasets into higher-level entities called chunks. The coredata ontology is not expressive enough to capture the seman-tics of datasets used in our motivating applications. So, weextended this core ontology (as shown in Fig. 5) to expressconcepts and properties in the spatial, multidimensional datamodel so that application data flow can be conveniently de-scribed in terms of these concepts. While this ontology isnot exhaustive, it is generic enough to represent typical ap-plication instances such as the PIQ and NC workflows. Itcan also be extended to support a wider range of data analy-sis applications within the domain.

Wings also provides compact representations in the tem-plate for component collections, i.e., cases where multiplecopies of a component can be instantiated to perform thesame analysis on different data input instances. The num-ber of such copies for a component can be selected duringthe workflow instance generation stage based on the prop-erties of the input datasets. Performance can be improvedby adjusting both the task- and data-parallelism parame-ter as well as application performance parameters such aschunksize. For the PIQ and NC workflows, the chunk-size parameter dictates the unrolling of component col-lections in the workflow template into a bag of componenttasks in the workflow instance. As an example, assume that

the value of the chunksize parameter chosen for a giveninput image was 2560×2400 pixels and resulted in each im-age slice being partitioned into 36 chunks. The correspond-ing expanded workflow instance for the PIQ workflow asshown in Fig. 4(b) contains 116 tasks. Component collec-tions are shown in the workflow instance as long horizontalrows of tasks, signifying that each collection has been un-rolled into 36 tasks, where each task represents an operationperformed on a single chunk. Thus, chunksize parameterinfluences the structure of the resulting workflow instance.

In our current system, we also support the notion ofmeta-components or explicit component grouping. A meta-component can be viewed as a combination or clusteringof components across multiple levels in a workflow tem-plate. Coalescing components into meta-components corre-sponds to the adjustment of the task granularity pa-rameter for performance optimization. During execution, alltasks within a meta-component are scheduled at once andmapped to the same set of resources. Note that the use ofmeta-components complements the existing job clusteringcapabilities provided by Grid workflow systems such as Pe-gasus [11]. Horizontal and vertical job clustering capabil-ities in Pegasus are used to group tasks that are explicitlyspecified in a workflow instance. However, it is highly de-sirable to also support the following features:

– Task parallelism: Vertical clustering implies that work-flow tasks at multiple levels are grouped into clustersso that each group can be scheduled for execution onthe same processor. To improve performance, task par-allelism across tasks within such groups, and also datastreaming from one task to another is needed to avoid diskI/O overheads.

– Fine-grain control: Workflows such as PIQ also containparallel components (e.g., MPI-style jobs) that executeacross a number of nodes, but whose processes are notexpressed as explicit tasks in the workflow instance. It isthe responsibility of the parallel runtime system at a siteto manage the processes of such components across thenodes at that site.

Meta-components can provide such task-parallelism andfine-grain control. By grouping successive components intoa meta-component in the workflow template, each such com-ponent’s processes can be concurrently scheduled for ex-ecution on the same compute resource. For example, thegrey rectangles shown in the workflow instance in Fig. 6indicate that all processes corresponding to all four com-ponents (shown as blue ovals) in the workflow have beenfused into a meta-component. One task corresponding toeach meta-component will be scheduled for execution con-currently. Within each meta-component task, the executionis pipelined, and data is streamed from one component toanother during execution.

Cluster Comput (2010) 13: 315–333 323

Fig. 6 Meta-components in theNC workflow instance

5.2 Execution module (EM)

The Execution Module (EM) consists of the following sub-systems that work in an integrated fashion to execute work-flows in a distributed environment and on cluster systems:

Pegasus workflow management system is used to reliablymap and execute application workflows onto diverse com-puting resources in the Grid [11]. Pegasus takes resource-independent workflow descriptions generated by the DMand produces concrete workflow instances with additionaldirectives for efficient data transfers between Grid sites.Portions of workflow instances are mapped onto differentsites, where each site could potentially be a heterogeneous,cluster-style computing resource. Pegasus is used to manip-ulate the component config parameter, i.e. the compo-nent transformation catalog can be modified to select an ap-propriate mapping from components to analysis codes for agiven workflow instance. Pegasus also supports runtime jobclustering to reduce scheduling overheads. Horizontal clus-tering groups together tasks at the same level of the work-flow (e.g. the unrolled tasks from a component collection),while vertical clustering can group serial tasks from succes-sive components. All tasks within a group are scheduled forexecution on the same set of resources. However, Pegasusdoes not currently support pipelined dataflow execution anddata streaming between components. This support is pro-vided by DataCutter [2] as explained later in this section.

Condor [23] is used to schedule tasks across machines.Pegasus submits tasks corresponding to a portion of a work-flow in the form of a DAG to DAGMan meta-schedulerinstances running locally at each Grid site. DAGMan re-solves the dependencies between jobs and accordingly sub-mits them to Condor, the underlying batch scheduler system.

DataCutter [2] is employed for pipelined dataflow execu-tion of portions of a workflow mapped to a Grid site con-sisting of cluster-style systems. A task mapped and sched-uled for execution by Condor on a set of resources maycorrespond to a meta-component. In that case, the execu-tion of the meta-component is carried out by DataCutterin order to enable the combined use of task- and data-parallelism and data streaming among components of themeta-component. DataCutter uses the filter-stream program-ming model, where component execution is broken downinto a set of filters that communicate and exchange datavia a stream abstraction. For each component, the analy-sis logic (expressed using high-level languages) like C++,Java, Matlab and Python) is embedded into one or morefilters in DataCutter. Each filter executes within a separatethread, allowing for CPU, I/O and communication overlap.Multiple copies of a filter can be created for data parallelismwithin a component. DataCutter performs all steps neces-sary to instantiate filters on the target nodes and invokes eachfilter’s analysis logic. A stream denotes a unidirectional dataflow from one filter (i.e., the producer) to another (i.e., the

324 Cluster Comput (2010) 13: 315–333

consumer). Data exchange among filters on the same nodeis accomplished via pointer hand-off while message passingis used for filters on different nodes. In our framework, weemploy a version of DataCutter that uses MPI for commu-nication to exploit high-speed interconnect technology onclusters that support them.

ECO compiler Within the execution of a single com-ponent task, the Empirical Compilation and Optimiza-tion (ECO) compiler can be employed to achieve targetedarchitecture-specific performance optimizations. ECO usesmodel-guided empirical optimization [5] to automaticallytune the fine-grain computational logic (where applicable)for multiple levels of the memory hierarchy and multi-coreprocessors on a target compute resource. The models andheuristics employed in ECO limit the search space, and theempirical results provide the most accurate information tothe compiler to tune performance parameter values.

5.3 Trade-off module (TM)

When large datasets are analyzed using complex operations,an analysis workflow may take too long to execute. In suchcases, users may be willing to accept lower quality outputfor reduced execution time, especially when there are con-straints on resource availability. The user may, however, de-sire that a certain application-level quality of service (QoS)be met. Examples of QoS requirements in image analysisinclude Maximize the average confidence in classificationof image tiles within t time units and Maximize the numberof image tiles, for which the confidence in classification ex-ceeds the user-defined threshold, within t units of time [17].We have investigated techniques which dynamically orderthe processing of data elements to speed up application ex-ecution while meeting user-defined QoS requirements onthe accuracy of analysis. The Trade-off Module (TM) drawsfrom and implements the runtime support for these tech-niques so that accuracy of analysis can be traded for im-proved performance.

We provide generic support for reordering the dataprocessing operations in our framework by extending Con-dor’s job scheduling component. When a batch of tasks(such as those produced by expansion of component col-lections in a Wings workflow instance) is submitted to Con-dor, it uses a default FIFO ordering of task execution. Con-dor allows users to set the relative priorities of jobs in thesubmission queue. However, only a limited range (−20 to+20) of priorities are supported by the condor_prio util-ity, while a typical batch could contain tasks correspondingto thousands of data chunks. Moreover, condor_prio doesnot prevent some tasks from being submitted to the queuein the first place. In our framework, we override Condor’sdefault job scheduler by invoking a customized schedulingalgorithm that executes as a regular job within Condor’s

“scheduler universe” and does not require any super-userprivileges. The scheduling algorithm implements a priorityqueue-based job reordering scheme [17] in a manner that isnot tied to any particular application. It uses the semanticrepresentations of data chunks in order to map jobs to thespatial coordinates of the chunks. When the custom sched-uler decides which chunk to process next based on its spa-tial coordinates and other spatial metadata properties, it usesthis association to determine the job corresponding to thischunk and moves it to the top of the queue. The customscheduler can be employed for any application within thespatial data analysis domain. The priority queue insertionscheme can be manipulated for different QoS requirementssuch that jobs corresponding to the favorable data chunksare scheduled for execution ahead of other jobs. In this way,the customized scheduler helps exercise control over theprocessing order parameter. When there are no QoSrequirements associated with the user query, our frameworkreverts to the default job scheduler within Condor.

5.4 Framework application

Our current implementation of the proposed framework sup-ports the performance optimization requirements associatedchunk-based image/spatial data analysis applications. How-ever, the framework can be employed in other data analysisdomains. The customized Condor scheduling module facil-itates a mechanism for trading analysis accuracy for per-formance with user-defined quality of service requirements.The current implementation instance of our framework pro-vides support for users to specify and express the values ofvarious performance parameters to improve performance ofthe workflow. When dealing with application-specific per-formance parameters at a fine-grain computation levels, themodel-guided optimization techniques in ECO can assistthe user in determining the optimal parameter values. Weview this as a first step towards an implementation that canautomatically map user queries to appropriate parametervalue settings. We target application scenarios where a givenworkflow is employed to process a large number of data ele-ments (or a large dataset that can be partitioned into a set ofdata elements). In such cases, a subset of those data elementscould be used to search for suitable parameter values (by ap-plying sampling techniques to the parameter space) duringworkflow execution and subsequently refining the choice ofparameter values based on feedback obtained from previousruns. Statistical modelling techniques similar to those usedin the Network Weather Service [25] can be used to predictperformance and quality of future runs based on informationgathered in previous runs.

6 Experimental evaluation

In this section, we present an experimental evaluation ofour proposed framework using the two real-world applica-

Cluster Comput (2010) 13: 315–333 325

tions, PIQ and NB, described in Sect. 3. Our evaluationwas carried out across two heterogeneous Linux clustershosted at different locations at the Ohio State University.The first one (referred to here as RII-MEMORY) consistsof 64 dual-processor nodes equipped with 2.4 GHz AMDOpteron processors and 8 GB of memory, interconnectedby a Gigabit Ethernet network. The storage system consistsof 2 × 250 GB SATA disks installed locally on each com-pute node, joined into a 437 GB RAID0 volume. The sec-ond cluster, (referred to here as RII-COMPUTE), is a 32-node cluster consisting of faster dual-processor 3.6 GHz In-tel Xeon nodes each with 2 GB of memory and only 10 GBof local disk space. This cluster is equipped with both anInfiniBand interconnect as well as a Gigabit Ethernet net-work. The RII-MEMORY and RII-COMPUTE clusters areconnected by a 10-Gigabit wide-area network connection—each node is connected to the network via a Gigabit card;we observed about 8 Gigabits/sec application level aggre-gate bandwidth between the two clusters. The head-node ofthe RII-MEMORY cluster also served as the master nodeof a Condor pool that spanned all nodes across both clus-ters. A Condor scheduler instance running on the head-nodefunctioned both as an opportunistic scheduler (for “vanillauniverse” jobs) and a dedicated scheduler (for parallel jobs).The “scheduler universe” jobs in Condor, including our cus-tomized scheduling algorithm, when applicable, run on themaster node. All other nodes of the Condor pool were con-figured as worker nodes that wait for jobs from the master.DataCutter instances executing on the RII-COMPUTE clus-ter use the MVAPICH flavor4 of MPI for communicationto exploit the InfiniBand interconnect. Our evaluation of theECO compiler’s automated parameter tuning was carried outindependently on a Linux cluster hosted at the University ofSouthern California. In this cluster (referred to as HPCC)we used a set of 3.2 GHz dual Intel Xeon nodes each with2 GB of memory for our evaluation.

In the following experiments, we evaluate the perfor-mance impact of a set of parameter choices on workflow ex-ecution time. First, a set of accuracy-preserving parametersare explored, as we would initially like to tune the perfor-mance without modifying the results of the computation. Wesubsequently investigate the accuracy-trading parameters.

6.1 Accuracy-preserving parameters

We used our framework to evaluate the effects of three dif-ferent accuracy-preserving parameters on the execution timefor the PIQ workflow—(i) the chunksize, a component-level parameter, (ii) the task granularity, a workflow-levelparameter, and (iii) numActiveChunks, a component-level parameter that is specific to the warp component. Two

4http://mvapich.cse.ohio-state.edu.

additional accuracy-preserving parameters—componentconfig and component placement were set to theiroptimal values based on our prior experience with and eval-uations of the PIQ workflow. In an earlier work [15], ourevaluations revealed that certain components of the PIQworkflow such as normalize and autoalign were much fasterwhen mapped for execution onto cluster machines equippedwith fast processors and high-speed interconnects. Based onthese evaluations and our knowledge of the data exchangebetween components in the workflow, we determined anoptimal component placement strategy which mini-mized overall computation time as well as the volume ofdata exchange between nodes. In this strategy, componentszproject, prenormalize, stitch, reorganize, warp and thepreprocess meta-component execute on the RII-MEMORYcluster nodes, while the normalize, autoalign and mst com-ponents are mapped to the faster processors of the RII-COMPUTE cluster. This component placement strat-egy allows for maximum overlap between computation anddata communication between sites for the PIQ workflow.In another earlier work [16], we designed multiple algo-rithmic variants for the warp and preprocess componentsof the PIQ workflow. Our evaluations helped us select themost performance-effective variant for these componentsdepending upon the resources they are mapped to. In ourexperiments, we fixed the values of these latter two parame-ters and then set out to tune or determine optimal values forthe remaining parameters. Our strategy was to treat differ-ent parameters independently, selecting a default value forone parameter while exploring the other. The decision as towhich parameter to explore first is one that can either bemade by an application developer, or can be evaluated sys-tematically by a set of measurements, such as the sensitivityanalysis found in [9].

Effects of chunksize For these experiments, we used a5 GB image with 8 focal planes, with each plane at a reso-lution 15,360 × 14,400 pixels. The chunksize parameterdetermines the unrolling factor for component collections inthe template shown in Fig. 4(a). Varying the chunksizeparameter value affects the structure of the resulting work-flow instance and the number of workflow tasks to be sched-uled, as shown in Table 1.

The disparity among the number of tasks in the result-ing workflow instances will increase as the images grow insize, because larger images can accommodate more combi-nations of the chunk dimensions. If workflow templates havea larger number of component collections, then the numberof tasks in the resulting workflow instance will vary moregreatly with chunksize value. As job submission and jobscheduling overheads are sizeable contributions to the over-all execution time, one possible optimization technique isto employ horizontal clustering of tasks from every compo-nent collection. The table also shows the number of tasks in

http://mvapich.cse.ohio-state.edu

326 Cluster Comput (2010) 13: 315–333

Table 1 Number of tasks in PIQ workflow instance for a givenchunksize parameter value

chunksize(pixels)

# of chunksin a plane

# of tasks in workflow

(noclustering)

(horizontalclustering::32)

512 × 480 900 2708 95

1024 × 960 225 683 32

1536 × 1440 100 308 20

2560 × 2400 36 116 14

3072 × 2880 25 83 9

5120 × 4800 9 35 9

the resulting PIQ workflow instance for each chunksizevalue when horizontal clustering by Pegasus is used to grouptasks from component collections into bundles of 32 taskseach. Ideally, with horizontal clustering, one should expectdiminishing job scheduling overheads and lesser disparityin the execution times observed at different chunksizevalues.

Figure 7(a) shows the overall execution time for the PIQworkflow instance when choosing different chunksizeparameter values, both with and without using horizontaljob clustering (we used 16 RII-MEMORY nodes and 8 RII-COMPUTE nodes for these experiments). This time is in-clusive of the time to stage data in and out of each site dur-ing execution, the actual execution times for each compo-nent, and the job submission and scheduling overheads. Weobserve that: (1) At both extremes of the chunksize pa-rameter value range, the execution times are high. The jobsubmission and scheduling overheads for the large numberof tasks dominate at smallest chunksize value. At thelargest chunksize value, the number of resulting chunksin an image becomes lesser than the number of workernodes available for execution. Since each task must processat least one chunk, the under-utilization of resources leadsto higher execution times. (2) The intermediate values forchunksize yield more or less similar performance, exceptfor an unexpected spike at chunksize = 2560 × 2400.On further investigation of the individual component exe-cution times, we observed that the algorithmic variant usedfor the warp component—which accounts for nearly 50%of the overall execution time of PIQ—performed poorly atthis chunksize value. Figures 7(b) and 7(c) show howchunksize value affects performance at the level of eachcomponent. (3) Horizontal job clustering, as expected, low-ers execution time at smaller chunksize values. At largerchunksize values, the lack of a large job pool to beginwith renders job clustering ineffective.

As mentioned earlier, chunksize value is expected tohave a greater impact on the performance for larger im-ages because of the larger range within which the parame-ter values can be varied. Figure 7(d) presents our evaluation

results obtained for a 17 GB image (with 3 focal planes,each plane having 36864 × 48000 pixels). We used 32 RII-MEMORY nodes and 8 RII-COMPUTE nodes for these ex-periments. The chunksize parameter has been shown us-ing separate axes for the chunk width and the chunk heightfor better viewing of the results. The surface was plottedbased on results obtained using 75 different values of thechunksize parameter. Again, we observe poor perfor-mance at the extreme low and high values of chunksizefor the same reasons outlined earlier. Here, we also observedthat chunksize values that correspond to long horizontalstripe chunks yielded better performance for the PIQ work-flow and this class of data. This tells us that most analy-sis operations in the PIQ workflow favor horizontal stripedchunks. In future endeavors, our framework will seek to de-termine the best values of the chunksize parameter byefficient navigation of the surface based on sampling tech-niques and trends generated from training data.

Effects of task granularity In these experiments, we coa-lesced components of the PIQ workflow into meta-compone-nts to produce a workflow template with a coarser task gran-ularity. Figure 8 illustrates an alternative workflow tem-plate for the PIQ application generated using Wings, thatgroups components into meta-components. Here, the zpro-ject and prenormalize steps from the original template arefused to form metacomp1; normalize, autoalign, mst arecollectively metacomp2 while stitch, reorganize, warp formmetacomp3. Preprocess is the fourth meta-component in theworkflow template and includes the thresholding, tessella-tion and prefix sum generation components. By using thisrepresentation, our framework further reduces the numberof tasks in a workflow instance. When component collec-tions are embedded within a meta-component, they are notexplicitly unrolled at the time of workflow instance genera-tion. Instead, they are implicitly unrolled at runtime withina corresponding DataCutter instance. That is, the chunk-size input parameter to a meta-component is propagatedto the DataCutter filters set up for each component withinthat meta-component. Based on the value of chunksize,DataCutter will create multiple transparent copies of filtersthat handle the processing of tasks within a component col-lection. Each such filter copy or instance will operate on asingle chunk at a time. We also manipulated the executionstrategy within the preprocess meta-component to supporttask-parallelism (i.e., pipelined execution of data chunks) inorder to avoid disk I/O overheads for the meta-component.

Figure 9 shows that the overall execution time for the PIQworkflow improves by over 50% when the task granular-ity strategy employed includes meta-components. The fig-ure also shows the changes in execution times for individualcomponents of the workflow. For parallel components likeautoalign and warp, there is no difference in performance

Cluster Comput (2010) 13: 315–333 327

Fig. 7 Effect of varying chunksize parameter on PIQ workflow execution

328 Cluster Comput (2010) 13: 315–333

Fig. 8 Wings workflow template for the PIQ application withmeta-components

Fig. 9 Execution time with different task granularity (5 GB image,40 nodes, chunksize= 512 × 480)

because the actual scheduling of processes within a compo-nent is carried out in the same way regardless of whethermeta-components are used or not. However, component col-lections like zproject, normalize and reorganize benefit fromthe high-granularity execution. This difference is attributedto the two contrasting styles of execution that our frameworkcan offer by integrating Pegasus and DataCutter. Pegasussupports batch-style execution where the execution of each

task is treated independently. If startup costs (e.g. MATLABinvocation or JVM startup) are involved in the execution ofa task within a component collection, such overheads are in-curred for each and every task in the collection. In contrast,DataCutter functions like a services-based system, where fil-ters set up on each worker node perform the desired startupoperations only once and process multiple data instances.The input data to a collection can thus be streamed throughthese filters. Depending on the nature of the tasks withina collection, our framework can use explicit unrolling andexecution via Pegasus, or implicit unrolling within a meta-component and execution via DataCutter.

Effects of parameter tuning for individual components Inthese experiments, we take a closer look at accuracy-preserving parameters that are specific to individual compo-nents in the workflow. Our goal here is to show how the auto-tuning of such parameters based on model-guided optimiza-tion techniques within the ECO compiler can reduce theexecution times of the individual components, and hence, ofthe overall workflow. In the PIQ workflow, one of the algo-rithmic variants of the warp component, known as the On-Demand Mapper (ODM), was shown to provide best perfor-mance among all variants, as long as the amount of physicalmemory available on the cluster nodes was not a limitingfactor in the execution [16]. The benefits of ODM derivedfrom the fact that ODM, unlike the other variants, seeks tomaintain large amounts of data in memory during execution.We identified a parameter called numActiveChunks thataffects the execution time of the ODM variant by limitingthe number of ‘active’ data chunks that ODM can main-tain in memory on a node at any given time. The largerthe number of active chunks that can be maintained inmemory, the greater the potential for data reuse when thewarping transformations are computed for the input data.The potential reduction in chunk reads and writes via datareuse leads to lesser execution time. The maximum value ofnumActiveChunks parameter depends on the size of thechunks and the available physical memory on a node. How-ever, the optimal numActiveChunks value may dependon other factors such as the potential for data reuse in theODM variant for a given memory hierarchy which need tobe modeled appropriately.

We introduced a novel technique for modeling that dis-covers the behavior of such application-level parameters thr-ough the use of functional models and statistical methodsin [20]. Our technique requires the component developer toidentify a set of integer-valued parameters, and also spec-ify the expected range for the optimal parameter value. Themodels are derived from sampling the execution time for asmall number of parameter values, and evaluating how wellthese sample points match a functional model. To derive thefunctional model of the sample data, we use the curve fitting

Cluster Comput (2010) 13: 315–333 329

toolbox in MATLAB to find the function that best fits thedata, and also to compute the R2 value in order to quantifythe accuracy of our model. Ultimately, we discovered thatthe double-exponential function (y = a ∗ eb∗x + c ∗ ed∗x )was the best overall match for the application-level parame-ters in our set of experiments. Using this double-exponentialfunction model, the algorithm can select the parameter valuefor which the function is minimized, and can dynamicallyidentify the neighborhood of the best result.

Our experiments were conducted on nodes of the HPCCcluster. Here, we evaluated only the warp component of thePIQ workflow. Specifically, we evaluated only the ODMvariant of the component in order to determine the opti-mal value of the numActiveChunks parameter. Figure 10shows results of our evaluation from warping a single sliceof the image data described earlier in this section. We try tomodel the behavior of the numActiveChunks parameterfor the input data using a statistical analysis approach. In thisfigure, the curve represents the double-exponential functionobtained using five sample data points, while the remainingpoints obtained from the exhaustive search within the rangeof permissible parameter values are also shown. The areabetween the dotted lines represents the range of parametervalues that our model predicts will have performance within2% of the best performance. In this case, we predicted theoptimal value of 74 for the numActiveChunks throughthe function model, which ended up being within 1.25% ofthe real best performance that one could obtain using theexhaustive search.

We validate our model by comparing against perfor-mance results from an exhaustive set of experiments acrossthe entire permissible parameter range. Overall, we haveachieved speedups up to 1.38X as compared to the worst-case performance from an exhaustive search while beingwithin 0.57% and 1.72% of the best performance from theexhaustive search. We examined only 5% of the parame-ter search space by computing five point samples. Our ex-

Fig. 10 Autotuning to obtain optimal value of numActiveChunksparameter for the warp component; 8 processors

perimental results show speedups of up to 1.30X as com-pared to the user-specified default parameter value. Whileour speedup is small when placed in the context of the over-all PIQ workflow, our selection of parameter values basedon a statistical analysis approach derives improved perfor-mance as compared to current user-specified values.

6.2 Accuracy-trading parameters

These experiments demonstrate performance gains obtainedwhen one or more accuracy-trading parameters were modi-fied in response to queries with QoS requirements. The op-timizations in these experiments are derived from the cus-tom scheduling approach described in Sect. 5.3 that makesdata chunk reordering decisions dynamically as and whenjobs are completed. In all our experiments, we observed thatthe overall execution time using our custom scheduling al-gorithm within Condor is only marginally higher than thatobtained from using the default scheduler, thereby show-ing that our approach introduces only negligible overheads.These experiments were carried out on the RII-MEMORYcluster5 with Condor’s master node running our customizedscheduler. We carried out evaluations using multiple images(ranging in size from 12 GB to 21 GB) that are characterizedby differences in their data (feature) content. We target userqueries with two kinds of QoS requirements: Requirement 1:Maximize average confidence across all chunks in an imagewithin time t; Requirement 2: Given a confidence in classi-fication threshold, maximize the number of finalized chunksfor an image within time t. These requirements can be metby tuning combinations of one or more relevant accuracy-trading parameters.

Tuning only the processing order parameter for re-quirement 1 This is useful when users wish to maximizeaverage confidence across an image while processing allchunks at the highest resolution, i.e. trading quality of resultfor the overall image, but not the quality of the result for in-dividual chunks. In such cases, the processing orderparameter can be tuned such that the Condor customizedscheduler prioritizes chunks that are likely to yield outputwith higher (confidence/time) value at the maximum reso-lution.

Figure 11 shows that a tuned parameter helps achieve ahigher average confidence at maximum resolution across allchunks. For example, after 800 seconds, the tuned parame-ter execution achieves an average confidence of 0.34 whichis greater than the 0.28 value achieved when no parameter

5Our goal here is only to demonstrate our framework’s ability to exploitperformance-quality trade-offs, and not adaptability to heterogeneoussets of resources. We note that this experiment could also be carriedout on the same testbed used for the PIQ application.

330 Cluster Comput (2010) 13: 315–333

Fig. 11 Quality Improvement by tuning only the processing or-der parameter

Fig. 12 Quality improvement by tuning the processing orderand resolution parameters

tuning is employed. This is because our customized sched-uler tunes the processing order parameter to reorderchunk execution in a manner that favors jobs correspondingto chunks that yield higher confidence.

Tuning both the resolution and processing orderparameters for requirement 1 This is useful when individ-ual chunks can be processed at lower resolutions so longas the resulting confidence exceeds a user-specified thresh-old. Figure 12 shows how parameter tuning helps achievehigher average confidence at all points during the execution.Each chunk is iteratively processed until only that target res-olution at which the confidence exceeds a threshold (set to0.25 here). Our custom scheduler prioritizes chunks that arelikely to yield output with higher ( confidence

time ) value at lowerresolutions.

Results obtained for other test images exhibited similarimprovement trends and also showed that our customizedscheduling extensions to Condor scale well with data size.

Fig. 13 Improvement in number of finalized tiles by tuning parameters

Tuning both the resolution and processing or-der parameters for requirement 2 Here, the customizedscheduler prioritizes jobs corresponding to chunks that arelikely to get finalized at lower resolutions. Figure 13 showshow parameter tuning in our framework yields an increasednumber of finalized chunks at every point in time during theexecution. The improvement for this particular case appearsvery slight because the confidence in classification thresh-old was set relatively high as compared to the average confi-dence values generated by the classify component, and thisgave our custom scheduler lesser optimization opportunitiesvia reordering.

Scalability In this set of experiments (carried out as partof requirement 2), we scaled the number of worker nodes inour Condor pool from 16 to 48. Our goal here was to de-termine if our custom scheduler could function efficientlywhen more worker nodes are added to the system. Figure 14shows how the time taken to process a certain number ofchunks in an image halves as we double the number of work-ers. Hence, the scheduler performance scales linearly whenan increasing number of resources need to be managed.

7 Summary and conclusions

Many scientific workflow applications are data and/or comp-ute-intensive. The performance of such applications can beimproved by adjusting component-level parameters as wellas by applying workflow-level optimizations. In some ap-plication scenarios, performance gains can be obtained bysacrificing accuracy of the analysis, so long as some mini-mum quality requirements are met. Our work has introduceda framework that integrates a suite of workflow description,mapping and scheduling, and distributed data processingsubsystems in order to provide support for parameter-based

Cluster Comput (2010) 13: 315–333 331

Fig. 14 Scalability with number of worker nodes

performance optimizations along multiple dimensions of theparameter space.

Our current implementation of the proposed frameworkprovides support for users to manually express the values ofthe various performance parameters in order to improve per-formance. We have customized the job scheduling moduleof the Condor system to enable trade-offs between accuracyand performance in our target applications. The experimen-tal evaluation of the proposed framework shows that adjust-ments of accuracy-preserving and accuracy-trading parame-ters lead to performance gains in two real applications. Theframework also achieves improved responses to queries in-volving quality of service requirements. As a future work,we will incorporate techniques to search the parameter spacein a more automated manner. We target application scenar-ios in which a large number of data elements or a largedataset that can be partitioned into a number of chunks areprocessed in a workflow. In such cases, a subset of the dataelements could be used to search for suitable parameter val-ues during workflow execution and subsequently refiningthe choice of parameter values based on feedback obtainedfrom previous runs.

Acknowledgements This research was supported in part bythe National Science Foundation under Grants #CNS-0403342,#CNS-0426241, #CSR-0509517, #CSR-0615412, #ANI-0330612,#CCF-0342615, #CNS-0203846, #ACI-b0130437, #CNS-0615155,#CNS-0406386, and Ohio Board of Regents BRTTC #BRTT02-0003 and AGMT TECH-04049 and NIH grants #R24 HL085343,#R01 LM009239, and #79077CBS10.

References

1. Acher, M., Collet, P., Lahire, P.: Issues in managing variabilityof medical imaging grid services. In: MICCAI-Grid Workshop(MICCAI-Grid) (2008)

2. Beynon, M.D., Kurc, T., Catalyurek, U., Chang, C., Sussman, A.,Saltz, J.: Distributed processing of very large datasets with Data-Cutter. Parallel Comput. 27(11), 1457–1478 (2001)

3. Brandic, I., Pllana, S., Benkner, S.: Specification, planning, andexecution of QoS-aware Grid workflows within the Amadeus en-vironment. Concurr. Comput. Pract. Exp. 20(4), 331–345 (2008)

4. Chang, F., Karamcheti, V.: Automatic configuration and run-timeadaptation of distributed applications. In: High Performance Dis-tributed Computing, pp. 11–20 (2000)

5. Chen, C., Chame, J., Hall, M.W.: Combining models and guidedempirical search to optimize for multiple levels of the memoryhierarchy. In: International Symposium on Code Generation andOptimization (2005)

6. Chiu, D., Deshpande, S., Agrawal, G., Li, R.: Cost and accuracysensitive dynamic workflow composition over grid environments.In: 9th IEEE/ACM International Conference on Grid Computing,pp. 9–16 (2008)

7. Chow, S.K., Hakozaki, H., Price, D.L., MacLean, N.A.B., Deer-inck, T.J., Bouwer, J.C., Martone, M.E., Peltier, S.T., Ellisman,M.H.: Automated microscopy system for mosaic acquisition andprocessing. J. Microsc. 222(2), 76–84 (2006)

8. Chung, I.H., Hollingsworth, J.: A case study using automatic per-formance tuning for large-scale scientific programs. In: 15th IEEEInternational Symposium on High Performance Distributed Com-puting, pp. 45–56 (2006)

9. Chung, I.H., Hollingsworth, J.K.: Using information from priorruns to improve automated tuning systems. In: SC ’04: Proceed-ings of the 2004 ACM/IEEE Conference on Supercomputing,p. 30. IEEE Computer Society, Washington (2004)

10. Cortellessa, V., Marinelli, F., Potena, P.: Automated selection ofsoftware components based on cost/reliability tradeoff. In: Soft-ware Architecture, Third European Workshop, EWSA 2006. Lec-ture Notes in Computer Science, vol. 4344. Springer, Berlin(2006)

11. Deelman, E., Blythe, J., Gil, Y., Kesselman, C., Mehta, G., Patil,S., Su, M.H., Vahi, K., Livny, M.: Pegasus: Mapping scientificworkflows onto the grid. In: Lecture Notes in Computer Science:Grid Computing, pp. 11–20 (2004)

12. Gil, Y., Ratnakar, V., Deelman, E., Mehta, G., Kim, J.: Wings forPegasus: Creating large-scale scientific applications using seman-tic representations of computational workflows. In: Proceedingsof the 19th Annual Conference on Innovative Applications of Ar-tificial Intelligence (IAAI) (2007)

13. Glatard, T., Montagnat, J., Pennec, X.: Efficient services composi-tion for grid-enabled data-intensive applications. In: Proceedingsof the IEEE International Symposium on High Performance Dis-tributed Computing (HPDC’06), Paris, France, 19 June 2006

14. Kong, J., Sertel, O., Shimada, H., Boyer, K., Saltz, J., Gurcan, M.:Computer-aided grading of neuroblastic differentiation: Multi-resolution and multi-classifier approach. In: IEEE InternationalConference on Image Processing, ICIP 2007, vol. 5, pp. 525–528(2007)

15. Kumar, V., Rutt, B., Kurc, T., Catalyurek, U., Pan, T., Chow, S.,Lamont, S., Martone, M., Saltz, J.: Large-scale biomedical imageanalysis in grid environments. IEEE Trans. Inf. Technol. Biomed.12(2), 154–161 (2008)

16. Kumar, V.S., Rutt, B., Kurc, T., Catalyurek, U., Saltz, J., Chow,S., Lamont, S., Martone, M.: Large image correction and warp-ing in a cluster environment. In: SC ’06: Proceedings of the 2006ACM/IEEE Conference on Supercomputing, p. 79. ACM, NewYork (2006)

17. Kumar, V.S., Narayanan, S., Kurç, T.M., Kong, J., Gurcan, M.N.,Saltz, J.H.: Analysis and semantic querying in large biomedicalimage datasets. IEEE Comput. 41(4), 52–59 (2008)

18. Lera, I., Juiz, C., Puigjaner, R.: Performance-related ontologiesand semantic web applications for on-line performance assess-ment intelligent systems. Sci. Comput. Program. 61(1), 27–37(2006)

332 Cluster Comput (2010) 13: 315–333

19. Ludäscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger, E.,Jones, M., Lee, E.A., Tao, J., Zhao, Y.: Scientific workflow man-agement and the Kepler system: Research articles. Concurr. Com-put. Pract. Exp. 18(10), 1039–1065 (2006)

20. Nelson, Y.L.: Model-guided performance tuning for application-level parameters. Ph.D. Dissertation, University of Southern Cali-fornia (2009)

21. Norris, B., Ray, J., Armstrong, R., Mcinnes, L.C., Shende, S.:Computational quality of service for scientific components. In:Proceedings of the International Symposium on Component-basedSoftware Engineering (CBSE7), pp. 264–271. Springer, Berlin(2004)

22. Oinn, T., Addis, M., Ferris, J., Marvin, D., Senger, M., Green-wood, M., Carver, T., Glover, K., Pocock, M.R., Wipat, A., Li, P.:Taverna: a tool for the composition and enactment of bioinformat-ics workflows. Bioinformatics 20(17), 3045–3054 (2004)

23. Thain, D., Tannenbaum, T., Livny, M.: Distributed computing inpractice: the Condor experience: Research articles. Concurr. Com-put. Pract. Exp. 17(2–4), 323–356 (2005)

24. Truong, H.L., Dustdar, S., Fahringer, T.: Performance metrics andontologies for grid workflows. Future Gener. Comput. Syst. 23(6),760–772 (2007)

25. Wolski, R., Spring, N., Hayes, J.: The network weather service:a distributed resource performance forecasting service for meta-computing. J. Future Gener. Comput. Syst. 15, 757–768 (1999)

26. Zhou, J., Cooper, K., Yen, I.L.: A rule-based component cus-tomization technique for QoS properties. In: Eighth IEEE Inter-national Symposium on High Assurance Systems Engineering,pp. 302–303 (2004)

Vijay S. Kumar is a Ph.D. can-didate in computer science at TheOhio State University where heworks as a graduate research asso-ciate. His research interests includedeveloping techniques and large-scale software systems to assistdata-intensive scientific applicationswith particular emphasis on scien-tific workflows execution in distrib-uted computing environments. Vijayreceived a B.E. in computer scienceand an M.Sc. in chemistry from theBirla Institute of Technology andScience, Pilani, India in 2003.

Tahsin Kurc received his Ph.D. incomputer science from Bilkent Uni-versity, Turkey, in 1997 and his B.S.in electrical and electronics engi-neering from Middle East TechnicalUniversity, Turkey, in 1989. He wasa Postdoctoral Research Associateat the University of Maryland from1997 to 2000. Before coming toEmory, he was a Research AssistantProfessor in the Biomedical Infor-matics Department at the Ohio StateUniversity. Dr. Kurc’s research ar-eas include high-performance data-

intensive computing, runtime systems for efficient storage and process-ing of very large scientific datasets, domain decomposition methodsfor unstructured domains, parallel algorithms for scientific and engi-neering applications, and scientific visualization on high-performancemachines.

Varun Ratnakar is a Research Pro-grammer at the USC/InformationSciences Institute. Varun got a Mas-ter’s degree from USC and a Bach-elor’s degree from the Delhi Col-lege of Engineering. His interestsinclude Computational Workflows,Knowledge Representation, andLinked Data (www.isi.edu/~varunr,[email protected]).

Jihie Kim is a Research AssistantProfessor in Computer Science atthe University of Southern Cali-fornia, a Computer Scientist at theUSC/Information Sciences Insti-tute. Dr. Kim received a Ph.D. fromthe University of Southern Califor-nia and a master’s and a bache-lor’s degrees from the Seoul Na-tional University. Her current inter-ests include knowledge-based ap-proaches to developing workflowsystems, intelligent user interfacesand pedagogical tools for online dis-cussions (http://www.isi.edu/~jihie,[email protected]).

Gaurang Mehta is a Research Programmer at USC Information Sci-ences Institute. He is a member of the Collaborative Computing Group,Advanced Systems Division at the Information Sciences Institute andworks on the Pegasus Project (http://pegasus.isi.edu). He received aM.S. in Electrical Engineering from University of Southern Californiaand a B.E. in Electronics Engineering from Mumbai University, India.He is currently the co-developer of Pegasus and the lead developer onthe Ensemble Workflow Manager. Gaurang has been a member of theIEEE for the last 11 years.

Karan Vahi is a Research Program-mer at USC Information SciencesInstitute. He is a member of theAdvanced Systems Division at theInformation Sciences Institute andworks on the Pegasus Project. He re-ceived a M.S. in Computer Sciencefrom University of Southern Cali-fornia and a B.E. in Computer Engi-neering from Thapar University, In-dia. He is currently the lead devel-oper on the Pegasus project.

http://www.isi.edu/~varunr

http://www.isi.edu/~jihie

http://pegasus.isi.edu

Cluster Comput (2010) 13: 315–333 333

Yoonju Lee Nelson recently re-ceived Ph.D. in computer sciencefrom University of Southern Cali-fornia in August, 2009. Her researchinterests include compiler optimiza-tion and performance tuning on par-allel computers. She also receivedher M.S. in computer science fromUniversity of Southern California,in 2001 and her B.S. in computerscience from University of Ulsan,South Korea, in 1994.

P. Sadayappan is a Professor ofComputer Science and Engineeringat the Ohio State University. Hisresearch centers around program-ming models, compilers and run-time systems for parallel comput-ing, with special emphasis on high-performance scientific computing.

Ewa Deelman is a Research Asso-ciate Professor at the USC Com-puter Science Department and aProject Leader at the USC Infor-mation Sciences Institute. Dr. Deel-man’s research interests include thedesign and exploration of collab-orative, distributed scientific envi-ronments, with particular empha-sis on workflow management aswell as the management of largeamounts of data and metadata. AtISI, Dr. Deelman is leading the Pe-gasus project, which designs andimplements workflow mapping

techniques for large-scale workflows running in distributed environ-ments. Dr. Deelman received her Ph.D. from Rensselaer PolytechnicInstitute in Computer Science in 1997 in the area of parallel discreteevent simulation.

Yolanda Gil is Associate DivisionDirector at the Information Sci-ences Institute of the University ofSouthern California, and ResearchAssociate Professor in the Com-puter Science Department. She re-ceived her M.S. and Ph.D. degreesin Computer Science from CarnegieMellon University. Dr. Gil leads agroup that conducts research on var-ious aspects of Interactive Knowl-edge Capture. Her research interestsinclude intelligent user interfaces,knowledge-rich problem solving,scientific and grid computing, and

the semantic web. An area of recent interest is large-scale distributeddata analysis through knowledge-rich computational workflows. Shewas elected to the Council of the American Association of ArtificialIntelligence (AAAI), and was program co-chair of the AAAI confer-ence in 2006. She serves in the Advisory Committee of the ComputerScience and Engineering Directorate of the National Science Founda-tion.

Mary Hall (Ph.D. 1991, Rice Uni-versity) joined the School of Com-puting at University of Utah as anassociate professor in 2008. Her re-search focuses on compiler-basedautotuning technology to exploitperformance-enhancing features ofa variety of computer architectures,investigating application domainsthat include biomedical imaging,molecular dynamics and signalprocessing. Prior to joining Univer-sity of Utah, Prof. Hall held posi-tions at University of Southern Cal-ifornia, Caltech, Stanford and RiceUniversity.

Joel Saltz is Director of the Cen-ter for Comprehensive Informat-ics, Professor in the Departments ofPathology, Biostatistics and Bioin-formatics, and Mathematics andComputer Science at Emory Univer-sity, Chief Medical Information Of-ficer at Emory Healthcare, AdjunctProfessor of Computational Scienceand Engineering at Georgia Tech,Georgia Research Alliance EminentScholar in Biomedical Informatics,and Georgia Cancer Coalition Dis-tinguished Cancer Scholar. Prior tojoining Emory, Dr. Saltz was Pro-

fessor and Chair of the Department of Biomedical Informatics at TheOhio State University (OSU) and Davis Endowed Chair of Cancer atOSU. He served on the faculty of Johns Hopkins Medical School, Uni-versity of Maryland College Park and Yale University in departmentsof Pathology and Computer Science. He received his M.D. and Ph.D.(computer science) degrees at Duke University and is a board certifiedClinical Pathologist trained at Johns Hopkins University.

parameterized speciﬁcation, conﬁguration and execution of ...€¦ · for data-intensive...

Documents