data scheduling for large scale …csc.lsu.edu/~balman/pdfs/pub/iceis_balman_dc_1116.pdfdata...

DATA SCHEDULING FOR LARGE SCALE DISTRIBUTEDAPPLICATIONS

Mehmet Balman and Tevfik KosarCenter for Computation & Technology and Department of Computer Science

Louisiana State University, Baton Rouge, LA 70803, [email protected], [email protected]

Keywords: Data-Intensive Applications, Scheduling, Distributed Computing, Grid Environment, Data Placement.

Abstract: Current large scale distributed applications studied by large research communities result in new challengingproblems in widely distributed environments. Especially, scientific experiments using geographically sepa-rated and heterogeneous resources necessitated transparently accessing distributed data and analyzing hugecollection of information. We focus on data-intensive distributed computingand describe data scheduling ap-proach to manage large scale scientific and commercial applications. We identify parameters affecting datatransfer and also analyze different scenarios for possible use cases of data placement tasks to discover keyattributes for performance optimization. We are planning to define crucialfactors in data placement in widelydistributed systems and develop a strategy to schedule data transfers according to characteristics of dynami-cally changing distributed environments.

1 INTRODUCTION

Current large scale distributed applications studied bylarge research communities result in new challeng-ing problems in distributed environments. Compu-tationally intensive science (e-Science), which haswide application areas including particle physics, bio-informatics and social simulations, demands highlydistributed networks and deals with immense data sets(Hey and Trefethen, 2003; Baker, 2004).

Especially, scientific experiments using geograph-ically separated and heterogeneous resources necessi-tated transparently accessing distributed data and an-alyzing huge collection of information (CMS, 2006b;Atlas, 2006; Holtman, 2001).

There are several studies concentrating on datamanagement in scientific applications (Allcock et al.,2001b; Allcock et al., 2001c; Venugopal et al., 2004);however, resource allocation and job scheduling con-sidering the data requirements still remain as an openproblem. Data placement is a part of the job schedul-ing dilemma and it should be considered as a crucialfactor affecting resource selection and scheduling indistributed computing (Kosar and Livny, 2004; Kosar,2006; Stork, 2006; Condor, 2006).

In our research, we focus on data-intensive dis-tributed computing and describe the data schedulingapproach to manage large scale scientific and com-mercial applications. We identify structures in theoverall model and also analyze different scenarios forpossible use cases of data placement tasks to discoverkey attributes affecting performance.

We start with data placement operations inside asingle host without network interference. We proceedfurther to discuss the possible effects on data move-ment either from or to multiple hosts. We extend theidea by examining conditions in a more complex sit-uation where there are many platforms connected toeach other and various data placement jobs betweenthem are needed to be scheduled for different com-puting steps of applications.

The rest of the paper is organized as follows. InSection 2, we present possible problems in the cur-rent large scale applications and give some motivat-ing remarks. In Section 3, we focus on the data place-ment challenge and we mention parameters that needto be used during scheduling to optimize data transfer.In Section 4, we analyze the data-intensive schedul-ing and present some recent studies considering datarequirements and data scheduling. Finally, we con-

50

clude, in Section 5, with future works and plannedmethodology to investigate data scheduling in widelydistributed environments.

2 BACKGROUND

We definedata as an illustration of information andconcepts in a formally organized way to be inter-preted and processed in order to accomplish comput-ing tasks. Therefore, computing itself cannot be tar-geted as the only dilemma without providing neces-sary protocols to deal with storing and transferringinformation.

Rapid progress in distributed computing and Gridtechnology have provided collaborative studies andessential compute-power for science, and made dataadministration a critical challenge in this area. Sci-entific applications have become more data intensivelike business applications; moreover, data manage-ment happens to be more demanding than computa-tional requirements in terms of needed resources (Heyand Trefethen, 2003; DOE, 2006).

Computation in science focus on many areas suchas astronomy, biology, climate, high-energy physicsand nanotechnology. Although, applications from dif-ferent disciplines have different characteristics; theirrequirements fall into similar fundamental categoriesin terms of data management. Workflow manage-ment, metadata description, efficient access and datamovement between distributed storage resources, andvisualization are some of the necessities for applica-tions using simulations, experiments, or other formsof information to generate and process the data.

The SuperNova project in astronomy is produc-ing terabytes of data per day, and a tremendous in-crease is expected in the volume of data in the nextfew years (Aldering and Collaboration, 2002). TheLSST (Large Synoptic Survey Telescope) is scanningthe sky for transient objects and producing more thanten terabytes of data per simulation (Tyson, 2002).Similarly, simulations in bimolecular computing gen-erate huge data-sets to be shared between geographi-cally distributed sites. In climate research, data fromevery measurement and simulation is more than oneterabyte. Besides, real-time processing of petabytesof data in experimental high-energy physics (Cern,2006) remains the main problem affecting quality ofthe real-time decision making.

Blast (Altschul et al., 1990), a bioinformatics ap-plication, is implemented to decode genetic informa-tion by searching for similarities in protein and DNAdatabases to map genomes of different species. Ba-sically, it takes search sequences and gene databases

as input and produces matched sequences as output.All input files are fetched to the execution site by apre-processing data transfer script. Inside the execu-tion site, multiple processes are started on differentcompute nodes. After completion of the running pro-cesses, useful information is collected from computenodes and another script is triggered to transfer out-put file. Since simple scripts are used, data transfer isnot optimized or scheduled. Therefore, we may en-counter many problems; network may be overloadedby many simultaneous transfers, storage space mayget full and all transfers can fail, redundant copies re-quested by different computational jobs may be trans-ferred to the same execution node.

CMS (CMS, 2006a) is a high energy physics ap-plication consisting of four steps. Each step takes in-put from the previous one and generates simulated in-formation. The staging-in and staging-out between allstep are accomplished by transfer scripts, so they arenot scheduled or storage space is not allocated beforestarting the data movement. Data transfers after theexecution are not considered as an asynchronous pro-cess and they are started immediately whether or notthe former job is scheduled and requesting input.

Although data placement has a great effect on theoverall performance of an application, it has beenconsidered as a side effect of computation and eitherembedded inside the computation task or managed bysimple scripts. Data placement should be efficient andreliable and also it should be planned cautiously in theprocess flow. We need to consider data placement anddata scheduling as a significant part of the distributedstructure to be able to get the best performance.

There are also numerous data-intensive applica-tions in business which have complex data workload.For example, Credit Card Fraud detection systemsneed to analyze every transaction passing throughthe transaction manager and detect fraud using somemodels and historical data to calculate an estimatedscore. Historical data and delivery of previous infor-mation should be managed in such a way that serverscalculating the score should access data effectively.Moreover, there are many financial institutions pro-viding data mining techniques for brokerage and cus-tomer services.

In addition, medical institutions are also one of thesources using computational network resources fortheir business. Particularly, large image transfers anddata streams between separate sites are major prob-lems in this area. Moreover, oil companies or elec-tronics design companies may have long term batchprocesses. Therefore, commercial applications haveshort and long computational jobs and they also havesmall transactions and large data movement require-

DATA SCHEDULING FOR LARGE SCALE DISTRIBUTED APPLICATIONS

51

ments. Since we have complex workload characteris-tics in business, a solution to data scheduling and dataplacement problem will be benefited by business aswell.

Storing the real-time data is already challengingin experimental applications. In addition, efficientdata transfer and organization such as metadata sys-tems are also crucial for simulation based applicationswhich are usually executed in batch mode. Further-more, those large-scale challenges in science neces-sitate joint study between various multi-disciplinaryteams using heterogeneous resources. Thus, new ap-proaches in data distribution and networking are nec-essary to enable grid in science as had been supportedby (Johnston et al., 2000; Venugopal et al., 2004).

One technical challenge in science is high capac-ity, fast storage systems. There has been consider-able effort in business to provide required informa-tion systems for commercial application; however, or-ganization of the data for querying in petabyte-scaledatabases still remain an open issue, where relationaldatabases cannot fit properly. Previously, there havebeen some work on special storage tools and meta-data models developed for specific scientific appli-cations (Root, 2006; Tierney et al., 1999). But westill need generic standards to integrate distributed re-sources with an open access mechanism to data.

3 THE DATA PLACEMENTCHALLENGE

There are usually several steps in the overall processflow of large scale applications. One important as-pect in distributed systems is to manage the interac-tion between computing tasks. Communication be-tween different steps can established by the deliveredinformation between them. Thedata i.e. the deliv-ered information, which characterizes the dependen-cies and connections, should be moved to supply in-put for the next computing processes between differ-ent tasks serving for different purposes.

We definedata placementas a coordinated move-ment of any information between related steps in thegeneral workflow. Reorganizing and transferring thedata between separate tasks in a single server, mov-ing data to different storage systems, or transmittingresults of the previous tasks to a remote site are allexamples of data placement. Data placement is notlimited to only information transfer over a network;however, one common use-case is to transfer data be-tween servers in different domains using wide-areanetwork. Interestingly, even data transfer inside a sin-gle machine is very important and requires effort to

get a sufficient result.We concentrate on several use cases of data place-

ment tasks (as shown in Figure 1) and define impor-tant attributes that will affect overall performance andthroughput of data transfer. Figure 1.a represents thedata transfer in a single execution site where thereis no network interference. Figure 1.b correspondsto the data placement between two different separatesites where data transfer between storage elements areperformed over a network. Figure 1.c and Figure 1.dshow data placement from or to a single storage el-ement. Figure 1.e illustrates a more complex sce-nario such that various data placement tasks need tobe scheduled between multiple hosts. Each scenariocomes with a different set of parameters to be usedduring the decision making of data placement execu-tion.

3.1 Data Transfer in a Single Host

Processes running in a single machine might havespecial input formats necessitating former data to beconverted properly and kept in a different storagespace to accomplish desired performance. Moreover,we may need to move some information to anotherstorage device due to space limitations or storagemanagement constraints. Different features of stor-age locations like availability and reliability, accessrate, and expected quality of service to get informa-tion force reorganization of data objects. Therefore,some data files may be transfered to another disk stor-age in a single host due to administrative conditions,performance implications (like different block sizesin file systems), and requirements of running executa-bles (such as formating and post-processing).

Movement of whole output or partial informationin a single domain also has difficulty in terms of ef-ficiency and performance. Besides, while copying asingle file between two disks, we should deal withvery common issues like space reservation, perfor-mance requirement, and CPU utilization. Load of theserver, available disk space, file system block size,and protocols used for accessing low-level I/O de-vices are some factors for this simple example.

In current enterprise systems, we usually havespecialized servers for different purposes like appli-cation, database, and backup servers. The most re-cent information is kept in fast servers and storagedevices; then, historical data or other log informationis transfered to other systems such as data warehousesystems.

Backup operations such as transferring data frommultiple storage disks to fast magnetic tape devicesrequires multiplexing techniques to maximize the uti-

DCEIS 2007 - Doctoral Consortium on Enterprise Information Systems

52

Figure 1: Data Placement Scenarios.

lization and to complete the backup at minimum time.Several blocks from different files are read at the sametime and pushed to the backup device in order to par-allelize the operation.

The problem of data transfer within a single hostseems simple. But actually it involves many sophis-ticated copy and backup tools in order to improvethe efficiency. I/O scheduling in current operating

systems with multiple storage devices is a good ex-ample of data placement in a single system. TheOS kernel schedules data placement tasks in a sin-gle host by ordering data operations and managingcache usage. We usually have several channels, some-times with different bandwidth, to access disk de-vices and usually more than one processor need tobe selected to execute an I/O operation in a multipro-cessor system. In addition to this, fairness betweenmultiple I/O requests, avoidance of process starva-tion, accomplishment of real time requirements andresolving deadlock conditions are some other func-tions of the operating-system kernel in a single sys-tem. Consequently, complex queuing techniques forfairness and load balancing between I/O operations,and greedy algorithms for managing cache usage anddisk devices (with different access latency and capac-ity), have been developed to schedule I/O operationsand to resolve the data placement problem in a singlehost.

Also, increase in the number of data placementtasks will make the problem more complicated; i.e.,a bunch of files which need to be copied to anotherstorage system should have high throughput thoughthere is time limitation for every file copy. On theother hand, we need to manage failures in hardwareand errors in transfer protocol.

Tuning the buffer size, using memory cache, par-titioning data, using different channels for multiplex-ing are some factors. Data placement in a networkalso deals with the same concerns that we face in asingle host, since in that case we also need to accessfile system layer and physical devices.

However, each of those factors and parameters af-fecting overall throughput and performance has dif-ferent impacts in different degrees. We need to firstsearch for the factors which improves the perfor-mance most, and then try to optimize parameters, ifthey have contributions above a threshold. For exam-ple, a backup operation to a slow tape device fromfast disk devices would not need cache optimizationor parallel streams since read/write overhead of thisslow tape device would be the bottleneck in the over-all process. So, in such a case we do not need to opti-mize these parameters.

We identify some important factors for data place-ment in a single host as follows:

• server load, CPU and memory utilization,

• available storage space, and space reservation,

• multiplexing and partitioning techniques,

• buffer size and cache usage,

• file system performance (e.g., block size, etc.),

• protocol and device performance,


53

• I/O subsystem and scheduling.

Although I/O operations in a single server hasmany parameters influencing data placement perfor-mance and data transfer scheduling, we only focus onserver side attributes like system load and utilization,space availability, and storage reservation to make thedata placement models simpler in a single host.

3.2 Data Placement between a Pair ofHosts

Data placement between two hosts has many usecases such as downloading a file or uploading somedata sets to be used by another application. Processeddata in a single step in a large scale application withmultiple, separate operational sites could be trans-ferred to other related tasks to continue the operationin the overall workflow.

Some factors affecting the efficient data move-ment are changing network conditions, heterogeneousoperating systems and working environments, fail-ures, available resources, and administrative deci-sions.

Since network is thought as the main bottleneckin wide-area networks while sending large amountof data between a client and a server, increasingthe number of simultaneous TCP connections is onemethod to gain performance. Thedata-flowprocessalso includes streaming, and transferring of partialdata object and replicas; however, a simple architec-ture build to transfer input/output files still remains adifficult concern.

Some file transfer tools like GridFTP (GridFTP,2006; Allcock et al., 2001a) support multiple TCPstreams and configurable buffer sizes. On the otherhand, it has been stated that multiple streams willnot work well with TCP congestion-control mecha-nism (Eggert et al., 2000; Floyd, 2000; Dunigan et al.,2002). Many simultaneous connections between twohosts may result in poor response times; and, it hasbeen suggested that maximum number of simultane-ous TCP streams to a single server should be one ortwo, especially under congested environments (Floyd,2000; Allman et al., 2000).

Use of multiple connections is declared as aTCP-unfriendly activity, but outstanding performanceresults were obtained by employing simultaneousstreams in the recent experiments (Kola et al., 2004b;Kosar, 2006; Kosar and Livny, 2004). Therefore,number of simultaneous TCP streams is one of thecritical factors for the transfers over a network, andmost importantly it can be easily employed to gainperformance.

Sudden performance decreases will be expectedif number of concurrent connections goes beyond thelimit of server capacity. The number of maximum al-lowable concurrent connections depends on both net-work and server characteristics, and selecting an ap-propriate value for each channel that will optimizethe transfer in long term is a challenging issue innetworked environments. Moreover, some protocolswhich are adjusted to utilize the maximum networkchannel bandwidth may have performance problemsif more than one channel is used (Kola et al., 2004a).

There are also various specialized TCP protocols(FastTCP, 2006; sTCP, 2006; Stream, 2006) whichchanges TCP features for large data transfers. TCPwindow buffer size may be the most important param-eter to be tuned in order to decrease the latency foran efficient transfer. There have been numerous stud-ies for fast file downloading (Plank et al., 2003); and,many tools have been developed to measure the net-work metrics (cFlowd, 2006; NetFlow, 2006) to tuneTCP by setting the window size for optimum perfor-mance.

Bandwidth is the first metric to get an estimateabout the duration of a transfer, but efficient use ofprotocols and network drivers can also serve as an im-portant metric as they cause significant improvementin terms of latency and utilization. Characteristics ofthe communication structure determine which actionshould be taken while tuning the data transfer. Localarea networks and Internet have different topologies,so they demonstrate diverse features in terms of con-gestion, failure rate, and latency. In addition, dedi-cated channels such as fiber-optic networks requiresspecial management techniques (Stream, 2006).

We also have high-speed network structures withchannel reservation strategies for fast and reliablecommunication. Congestion is not a concern in sucha previously reserved channel; and, transfer protocolshould be designed to send as much data as possiblefor maximum utilization. We need to provide adap-tive approaches to understand the underlying networklayer and to optimize data movement accordingly.

Application and storage layers should feed thenetwork with enough data in order to obtain highthroughput. Network optimization will be worth-less if we cannot get adequate response from stor-age device. Buffer size that determines the volume ofdata read from storage should fit into network bufferused in the transfer protocol. Besides, server capac-ity and performance also have crucial roles and theyidentify how much data can be sent or received ina given interval. Therefore, storage systems havebeen developed for efficient movement of data (Kolaet al., 2004a; Kosar et al., 2004) by providing caching


54

mechanism and multi-threaded processing.We identify new parameters, which should be con-

sidered if data placement is performed over a network,as follows:

• simultaneous TCP connections,

• TCP tuning (send/receive buffer size, TCP win-dow size, MTU, etc.),

• data transmission protocol performance(GridFTP, ftp, etc.).

Consequently, parameters that can be used in dataplacement problem between a pair of hosts can be ex-tended by including the attributes related to the net-work such as degree of parallelization with concur-rent channels, bandwidth reservation in dedicated net-works, and transfer protocol optimization like tuningTCP window size. On the other hand, we are alsobounded by the server performance and we shouldconsider efficiency, performance, and space availabil-ity of servers to manage data placement between apair of hosts.

3.3 Data Placement from MultipleServers to a Single Server

Nowadays, scientific and industrial applications needto access terabytes of data and they need to work withdistributed data sets. One reason to keep informationin distributed data stores is the large amount of re-quired data. Another concern is reliability such thatreplicas of information are stored in different systemsapart from each other. Data management also focuseson replication in order to ensure data proximity andreliability. Alternatively, accessing separated data setsat the same time provides performance via parallelcommunication.

Data transfer from multiple clients to a single hosthas many diverse samples. Data objects may be storedon separate storage servers due to space limitations orload balancing. Besides, running application modulesmay require remote access to get information gener-ated by remote sites. Thus, a process scheduled in anexecution site may need to start multiple data transferjobs to obtain its input.

One simple example is downloading multiple filesor data objects from several sites in which we want tofinish the operation in the minimum amount of time.In addition, we may prefer to move same data objectsfrom different sites to ensure data availability in caseof hardware and system failures in remote machines.Often, transferring files one by one is not preferredunless there is a special condition. Starting everydownload at the same time to get maximum utiliza-tion may seem as a good idea; however, efficiency of

the data transfer is limited to server and network ca-pacity.

Parallelization in network has two methods; oneis concurrent connections discussed in previous sub-section, and the other is parallel channels to everyserver. Concurrency is obtained by opening multipleTCP streams between a pair of server. On the otherhand, parallel connections are maintained by multiplethreads serving each data transfer from different sites.

The number of parallel connections from a clientto each server is increased in order to get the max-imum network utilization. Parallelization may haveside effects similar to concurrency if generated load isover the limits of system ability. Both network prop-erties such as bandwidth, failure rate, congestion, andserver parameters like CPU load, memory and com-munication protocol have influences on paralleliza-tion. Therefore, number of parallel connections to aserver, which is set according to the condition of net-work environment, is one of the critical parameterswe should concentrate on.

Another example is to upload multiple files fromdifferent file servers to the execution site where we arerunning various tasks such that each task is using sep-arate file sets staged to the execution site. Each taskhas diverse dependencies to some data objects storedin remote clients. Main purpose is to optimize thetransfer so that maximum number of tasks can startexecution. To simplify the problem, we can modifythe objective and state optimization as maximizingthe number of files arrived in the minimum amountof time.

One simple method for the given example is todeliver files with small size over high bandwidth con-nections to obtain the maximum number of trans-ferred files. In order to get the best throughputover network, selecting large files and closer storageservers is another approach for optimization. In eithercase, we necessitate an ordering of data placementtasks. Even though if we are delivering replicas ofa single object distributed across hosts, we still needto have a decision to minimize the required time.

Each connection from a server to the client thatperforms data movement operation might have differ-ent characteristics. Bandwidth, latency and numberof hops a network packet should go through can bedifferent. Thus, data placement jobs need to be sched-uled according to conditions in network and servers.

Moreover, we expect different characteristics be-tween uploading multiple files to a single server anddownloading a file to multiple clients. Both caseslook similar in network side but they differentiate interms of resources they use in the communication pro-tocol. In the first case, transfer protocol should man-


55

In

Single

Host

Between

a Pair of

Hosts

Multiple

Servers

to Single

Server

Between

Distributed

Servers

Available

Storage Space

ü ü ü ü

CPU Load and

Memory Usage

ü ü ü ü

Transfer Protocol

Performance

ü ü ü

Number of

Concurrent

Connections

ü ü ü

Network

Bandwidth and

Latency

ü ü

Number of

Parallel Streams

ü ü

Ordering of

Data Placement

Tasks

ü

Figure 2: Key Attributes affecting Data Placement.

age all channels and write data into the storage. Inthe second case, data should be read and pushed toall channels connected to the server. Read and writeoverhead in the transfer protocol affect operating costin different levels so that specific actions are requiredto exploit the data placement for best performance.

We mainly focus on two additional factors in dataplacement from multiple servers to a single host;

• parallel network connections,

• network bandwidth and latency.

Data placement from multiple servers to a singleserver is affected both by server and network condi-tions. We also need to consider the bandwidth andlatency of the network to come up with a decisionand order transfer tasks accordingly in order to ob-tain higher throughput. Furthermore, parallel connec-tions to a server is a fundamental technique used togain more utilization in the communication channelduring the transfer operation.

3.4 Data Placement betweenDistributed Servers

There have been many studies for high through-put data transfer in which best possible paths be-tween remote sites are chosen and servers are opti-mized to gain high data transfer rates (Dykstra, 2006).TCP buffer, window size, MTU are some of the at-tributes used in the overall network transfer optimiza-

tion. Moreover, high speed networks require differentstrategies to gain best performance. Parallel and con-current transfer streams are used for high performancebut unexpected results may occur in highly loadednetworks (Dykstra, 2006).

Data servers are distributed on different locationsand available network is usually shared like Internet;therefore, minimizing the network cost by selectingthe path which gives maximum bandwidth and mini-mum network delay to obtain high speed transfer willincrease the overall throughput (Allen and Wolski,2003; NWS, 2006).

In widely distributed network environments, re-sources such as network bandwidth and computingpower are shared by many jobs. Since data placementbetween multiple clients to a single host requires or-dering in the data placement jobs, data transfer be-tween distributed servers is even more complicated.We necessitate a central scheduling mechanism whichcan collect information from separate sites and sched-ule data placement jobs such that maximum through-put in minimum time is achieved.

In more complex realistic situations, network isjointly used by several users unless dedicated chan-nels are allocated. It is really hard to obtain the stateof network condition especially in widely distributedsystem.

Parameters like bandwidth and latency that needto be used during the decision process are dynam-ically changing. We can measure the values of at-tributes like CPU load, memory usage, available diskspace on server side. On the other hand, we usuallyutilize predicted values for network features whichare calculated according to previous states.

In order to clarify the data placement scenario westate a simple example in which multiple files in dis-tributed data centers are need to be transfered to a cen-tral location and we need to upload results of previ-ous computations to those data centers. A job whichrequires data sets from other data objects has beenscheduled and started execution in a computing site,so we need to transfer data sets from data servers tothe execution site. We may have some other tasksin the execution site trying to send files to this dataservers at the same time. We may also need to transferdata to the data server from some other sites locatedseparately.

Since each data movement job is competing witheach other, a decision making is required to havethem ordered and run concurrently. Without a centralscheduling mechanism, saturation both in networkand servers is inevitable. We may need to decline anupload operation till data files are downloaded so thatthere is available space in the storage. Moreover, we


56

might delay a data transfer job if network is underheavy utilization due to some other data placementtasks. On the other hand, decision on the number ofparallel streams and concurrent connections has be-come more difficult such that there may be more thanone task transferring data at the same time.

As a conclusion, the problem itself is compli-cated and it depends on many parameters. Figure 2summarizes some parameters discussed in this sec-tion. We analyze the problem starting from sim-plest scenario and extend to more complex situationsby stating critical key features affecting the perfor-mance. Besides, scheduler should be able to makerapid choices and dynamically adapt itself to chang-ing situations. Instead of applying composite algo-rithms, simple scheduling techniques will be efficientto have at least acceptable results.

4 DATA-INTENSIVESCHEDULING

Current scientific and commercial applications con-sist of several stages to be processed mostly in differ-ent computing sites. Often, each processing step istriggered by the completion of previous phases or byavailability of some resources like a specific data setto initiate the execution.

A simple architectural configuration of distributedcomputing for large scale applications can be evalu-ated in four phases. First, we necessitate workflowmanagers to define the dependencies of execution se-quences in the application layer. Second, higher levelplanners are used to select and match appropriate re-sources. Then, a data-aware scheduler is required toorganize requests and schedule them not only consid-ering computing resources but also data movement is-sues. Finally, we have data placement modules andtraditional CPU schedulers to serve upper layers andcomplete job execution or data transfer. Figure 3 rep-resents a detailed view of data-intensive system struc-ture.

We can simply classify the steps in a large scaleapplication as follows; (1) obtain data from experi-ments or simulate to generate data; (2) transfer dataand organize for pre-processing; (3) data analysis andrecognition; (4) move data for post-processing, visu-alization or interactive analysis. We usually transferdata to separate centers for individual analysis; eventhough, different science groups may require onlysome subsets of the data such that data set groupsfrom different activities may be used to extract somefeatures. A simple flow in the overall process isshown in Figure 4.

Figure 3: Data-aware System Model.

Execution sequence and data flow in scientificand commercial applications need to be paralleland should provide load-balancing in order to han-dle complex data flows between heterogeneous dis-tributed systems. There have been recent studies forworkflow management (Ludscher et al., 2006; Tayloret al., 2005; Taylor et al., 2007), but providing in-put/output order for each component, sub workflows,and performance issues and data-driven flow controlare still open for research and development. There-fore, data placement is an effective factor in workflowmanagement, and scheduling of data placement tasksneed to be integrated with workflow managers.

Transfers especially over wide-area encounter dif-ferent problems and should deal with obtaining suffi-cient bandwidth, dealing with network errors and al-locating space both in destination and source. Thereare similar approaches in different software to makeresources usable to science community (Foster andKesselman, 1998; Thain et al., 2003; Buyya andVenugopal, 2004).

Replica catalogs, storage management and datamovement tools are addressing some of those issuesin the middleware technology (Ito et al., 2006a; Itoet al., 2006b; Madduri et al., 2002; Czajkowski et al.,2001). Importance of data scheduling is first empha-sized by (Kosar, 2005), and data has been stated as afirst class citizen in the overall structure.

Traditional CPU-based scheduling systems do notmeet the requirements of the next generation sci-ence and business applications in which data sourcesshould be discovered and accessed in a flexible andefficient manner during the job execution. Moreover,developments in distributed computing infrastructuresuch as fast connections between widely separated


57

Figure 4: Data Flow.

organizations and implementation of huge computa-tional resources all over the world made data schedul-ing techniques to be considered as one of the most im-portant issues. Data intensive characteristics of cur-rent applications brought a new concept of data-awarescheduling in distributed scientific computation.

Some recent research focus on simulating theData Grid to investigate behavior of various allocationapproaches (Ranganathan and Foster, 2003; Optor-SIM, 2006; Laure, 2004). Beside the simple greedyscheduling techniques such as Least Frequently Usedand Least Recently Used, there are also some eco-nomical models handling data management and taskallocation as a whole while making the scheduling de-cision (Venugopal et al., 2004).

One recent study concludes that allocating re-sources closest to the data required gives the bestscheduling strategy (Ranganathan and Foster, 2002;Ranganathan and Foster, 2004). There are numerouswork done for replica management, high performancedata transfer, and data storage organization; however,there is still a gap in data-aware scheduling satisfyingrequirements of current e-Science applications.

5 METHODOLOGY ANDDISCUSSION

Due to the nature of distributed environments, under-lying infrastructure needs to be managing dynamicbehavior of heterogeneous systems, communicationoverhead, resource utilization, location transparencyand data migration. Data placement in distributed in-frastructure should use parallelism and concurrencyfor performance issues. Access and failure trans-parencies are other major issues such that user tasksshould access resources in a standard way and com-plete execution without being affected by dynamicchanges in the overall distributed system.

Data placement jobs have been categorized in dif-ferent types (transfer, allocate, release, remove, lo-

cate, register, unregister) and it has been stated thateach category has different importance and optimiza-tion characteristics (Kosar, 2005). In order to simplifythe problem, we focus only on data movement strate-gies in data intensive applications. One importantdifficulty is to manage storage space in data servers.Input data should be staged-in and generated outputshould be staged-out after the completion of the job.Insufficient storage space will delay the execution, orthe running task may crash due to improperly manag-ing store space in the server. Some storage servers en-able users to allocate space before submitting the joband they publish status information such as availablestorage space. However, there may be some exter-nal access to the storage preventing the data schedulerto make good predictions before starting the trans-fer. Allocating space and ordering transfers of dataobjects are basic tasks of a data scheduler. Differenttechniques such as smallest fit, best fit, and largest fithave been studied in (Kosar, 2006). Data servers havelimited storage space and limited computing capacity.In order to obtain best transfer throughput, underly-ing layers in server site that may affect the data move-ment should be investigated cautiously. Performancesof the file systems, network protocol, concurrent andparallel transfers are some examples influencing dataplacement efficiency.

Another important feature of a data placementsystem is fault tolerant transfer of data objects. Stor-age server may crash due to too many concurrentwrite requests; data may be corrupted because of afaulty hardware; transfer can hang without any ac-knowledgment. A data transfer model should mini-mize the possibility of failures while scheduling andit should also handle faulty transfer in a transparentway.

First, we need to define attributes having a part indata placement and investigate their affects during thetransfer operation in terms of efficiency and perfor-mance. Moreover, every parameter can be categorizedaccording to the level in which it can be used suchthat some attributes need to be set automatically bythe middleware and some of them can be defined byusers. There are also some on-going projects prepar-ing APIs for applications in widely distributed sys-tems (Allen et al., 2002; Hirmer et al., 2006). There-fore, a detailed study to describe possible use casesand to define factors in data placement is mandatoryin order to propose a data scheduling framework.

Another issue is investigating effects of each fac-tor in different conditions. Since final goal is topromote a scheduling methodology, we should havea model arranging factors according to their signifi-cance so that we can make decisions on ordering and


58

initiating data placement jobs using measured or pre-dicted metrics. We may need to prepare testbeds forexperimenting different scenarios in order to deter-mine impact of various attributes in the overall pro-cess. A simple test scenario is shown in Figure 5.

Figure 5: Experiments on Data Placement.

In order to prepare a testbed in which different or-dering approached can be implemented and new pa-rameters affecting data transfer can be searched, westart with a simple configuration using the followingattributes; network optimization (transfer rate), paral-lel streams, concurrent transfers, available/used diskspace, and the used data transfer protocol.

The research on data scheduling is expected toresult in a scheduling model based on ordering andgrouping data placement tasks which can be eas-ily implemented and employed to current architec-ture. Besides, we will integrate our methodology withother middleware tools.

As a conclusion, we will first concentrate on dataplacement scenarios in widely distributed systems.By the help of experiments and measurements in reallife situations, we will investigate crucial factors in-fluencing data transfer operations. We are planningto develop a strategy to schedule data transfers in dis-tributed environments; and studying characteristics ofdata placement jobs will let us provide a model forscheduling.

Data scheduler has to be efficient and it has to be

easily employed to different architectures. Therefore,greedy optimization algorithms are planned to be con-sidered despite complex techniques. Even though,some simple observations like resource allocation andchecking for available disk space will improve theoverall effectiveness. Thus, we first concentrate onhow to improve decision process by exploring metricsthat are ignored till now. Another concern is integrat-ing data scheduler with job management systems andalso other middleware tools. During the schedulingprocess, we should consider both data placement andjob execution requirements, so data scheduler shouldalso support standard job management protocol tocommunicate with other services in distributed sys-tems.

ACKNOWLEDGEMENTS

This work was supported by NSF grant CNS-0619843, Louisiana BoR-RCS grant LEQSF (2006-09)-RD-A-06, and CCT General Development Pro-gram.

REFERENCES

Aldering, G. and Collaboration, S. (2002).Overview of the SuperNova/Acceleration Probe(SNAP). Retrieved February, 2007, fromhttp://www.citebase.org/abstract?id=oai:arXiv.org:astro-ph/0209550 .

Allcock, B., Bester, J., Bresnahan, J., Chervenak, A.,Foster, I., Kesselman, C., Meder, S., Nefedova, V.,Quesnel, D., and Tuecke, S. (2001a). Secure, effi-cient data transport and replica management for high-performance data-intensive computing. InIEEE MassStorage Conference, San Diego, CA.

Allcock, B., Foster, I., Nefedova, V., Chervenak, A.,Deelman, E., Kesselman, C., Lee, J., Sim, A.,Shoshani, A., Drach, B., and Williams, D. (2001b).High-performance remote access to climate simula-tion data: A challenge problem for data grid tech-nologies. InSupercomputing ’01: Proceedings ofthe 2001 ACM/IEEE conference on Supercomputing(CDROM), pages 46–46, New York, NY, USA. ACMPress.

Allcock, W., Bester, J., Bresnahan, J., Chervenak, A., Fos-ter, I., Kesselman, C., Meder, S., Nefedova, V., Ques-nel, D., and Tuecke, S. (2001c). Data managementand transfer in highperformance computational gridenvironments.Parallel Computing. 2001.RetrievedFebruary, 2007, fromhttp://citeseer.ist.psu.edu/article/allcock01data.html .

Allen, G., Davis, K., Dramlitsch, T., Goodale, T., Kelley,I., Lanfermann, G., Novotny, J., Radke, T., Rasul, K.,


59

Russell, M., Seidel, E., and Wehrens, O. (2002). Thegridlab grid application toolkit. InHPDC, page 411.

Allen, M. and Wolski, R. (2003). The livny and plank-beckproblems: Studies in data movement on the computa-tional grid. InSupercomputing 2003.

Allman, M., Dawkins, S., Glover, D., Griner, J., Tran, D.,Henderson, T., Heidemann, J., and Semke, J. (Febru-ary 2000). Ongoing TCP Research Related to Satel-lites. IETF RFC 2760. Retrieved February, 2007, fromhttp://www.ietf.org/rfc/rfc2760.txt .

Altschul, S., Gish, W., Miller, W., Myers, E., and Lipman,D. (1990). Basic local alignment search tool.J MolBiol, 215(3):403–10.

Atlas (2006).A Toroidal LHC ApparatuS Project (ATLAS).Retrieved February, 2007, fromhttp://atlas.web.cern.ch/ .

Baker, M. (2004). Ian foster on recent changes in the gridcommunity.Distributed Systems Online, IEEE, 5:4/1–10.

Buyya, R. and Venugopal, S. (2004). The gridbus toolkitfor service oriented grid and utility computing: Anoverview and status report. In1st IEEE Int. WorkshopGrid Economics and Business Models (GECON 2004.

Cern (2006). The world’s largest particle physics labo-ratory, European Organization for Nuclear Research.Retrieved February, 2007, fromhttp://public.web.cern.ch .

cFlowd (2006). Traffic Flow Analysis Tool. Re-trieved February, 2007, fromhttp://www.caida.org/tools/measurement/cflowd/ .

CMS (2006a). The Compact Muon Solenoid. RetrievedFebruary, 2007, fromhttp://cmsinfo.cern.ch/outreach/ .

CMS (2006b). The US Compact Muon Solenoid Project.Retrieved February, 2007, fromhttp://uscms.fnal.gov/ .

Condor (2006). CONDOR: High Throughput Comput-ing. Retrieved February, 2007, fromhttp://www.cs.wisc.edu/condor/ .

Czajkowski, K., Fitzgerald, S., Foster, I., and Kesselman,C. (2001). Grid Infromation Services for DistributedResource Sharing. InProc. of the 10th IEEE HighPerformance Distributed Computing, pages 181–184.

DOE (2006). Center for Enabling Distributed PetascaleScience. A Department of Energy SciDAC Cen-ter for Enabling Technology. Retrieved February,2007, from http://www.cedps.net/wiki/index.php/Main_Page .

Dunigan, T., Mathis, M., and Tierney, B. (2002). A TCPTuning Daemon. InProceedings of the SC02: HighPerformance Networking and Computing Conference.

Dykstra, P. (November 2006).High Performance DataTransfer. SC2006 Tutorial M07. Retrieved February,2007, from http://www.wcisd.hpc.mil/ ˜ phil/sc2006/M07-2_files/frame.htm .

Eggert, L. R., Heidemann, J., and Touch, J. (2000). Effectsof Ensemble-TCP.CCR 30(1). Retrieved February,

2007, from http://www.acm.org/sigcomm/ccr/archive/2000/jan00/ccr-200001-eggert.ps .

FastTCP (2006). An Alternative Congestion Control Al-gorithm in TCP. Retrieved February, 2007, fromhttp://netlab.caltech.edu/FAST/ .

Floyd, S. (2000).Congestion control principles. IETF RFC2914. Retrieved February, 2007, fromhttp://www.ietf.org/rfc/rfc2914.txt .

Foster, I. and Kesselman, C. (1998). The Globus Project: AStatus Report. In7th IEEE Heterogeneous ComputingWorkshop (HCW 98), pages 4–18.

GridFTP (2006). Protocol Extensions to FTP for theGrid. Retrieved February, 2007, fromhttp://www.globus.org/grid_software/data/gridftp.php .

Hey, T. and Trefethen, A. (2003). The Data Deluge: Ane-science perspective.Grid Computing: Making theGlobal Infrastructure a Reality. Chichester, UK: JohnWiley & Sons, Ltd., pages 809–824.

Hirmer, S., Kaiser, H., Merzky, A., Hutanu, A., and Allen,G. (2006). Generic support for bulk operations in gridapplications. InMCG ’06: Proceedings of the 4th in-ternational workshop on Middleware for grid comput-ing, page 9, New York, NY, USA. ACM Press.

Holtman, K. (July 2001). CMS data grid system overviewand requirements.CMS Note 2001/037,CERN, 99.

Ito, T., Ohsaki, H., and Imase, M. (2006a). Gridftp-apt: Au-tomatic parallelism tuning mechanism for data trans-fer protocol gridftp.ccGrid, 0:454–461.

Ito, T., Ohsaki, H., and Imase, M. (2006b).On Param-eter Tuning of Data Transfer Protocol. Gridftp ForWide-Area. Retrieved February, 2007, fromhttp://citeseer.ist.psu.edu/733457.html .

Johnston, W. E., Gannon, D., Nitzberg, B., Tanner, L. A.,Thigpen, B., and Woo, A. (2000). Computing and datagrids for science and engineering. InSupercomputing’00: Proceedings of the 2000 ACM/IEEE conferenceon Supercomputing (CDROM), page 52, Washington,DC, USA. IEEE Computer Society.

Kola, G., Kosar, T., Frey, J., Livny, M., Brunner, R., andRemijan, M. (2004a). Disc: A system for distributeddata intensive scientific computing. InProc. of FirstWorkshop on Real, Large Distributed Systems. SanFrancisco, CA, December 2004.

Kola, G., Kosar, T., and Livny, M. (2004b). Profiling griddata transfer protocols and servers. InProceedingsof 10th European Conference on Parallel Processing(Europar 2004), Pisa, Italy.

Kosar, T. (2005). Data placement in widely distributed sys-tems.Ph.D. Thesis, University of Wisconsin-Madison.

Kosar, T. (June 2006). A new paradigm in data intensivecomputing: Stork and the data-aware schedulers. InChallenges of Large Applications in Distributed Envi-ronenments (CLADE 2006) Workshop. HPDC 2006.

Kosar, T., Kola, G., and Livny, M. (2004). Data pipelines:enabling large scale multi-protocol data transfers. InMGC ’04: Proceedings of the 2nd workshop on Mid-dleware for grid computing, pages 63–68, New York,NY, USA. ACM Press.


60

Kosar, T. and Livny, M. (2004). Stork: Making Data Place-ment a first class citizen in the grid. InIn Proceedingsof the 24th Int. Conference on Distributed ComputingSystems, Tokyo, Japan, March 2004.

Laure, E. (December 2004). The EU datagrid setting the ba-sis for production grids.Journal of Grid Computing.Springer, 2(4).

Ludscher, B., Altintas, I., Berkley, C., Higgins, D., Jaeger,E., Jones, M., Lee, E. A., Tao, J., and Zhao, Y. (2006).Scientific workflow management and the kepler sys-tem: Research articles.Concurr. Comput. : Pract.Exper., 18(10):1039–1065.

Madduri, R. K., Hood, C. S., and Allcock, W. E. (2002). Re-liable file transfer in grid environments. InLCN ’02:Proceedings of the 27th Annual IEEE Conference onLocal Computer Networks, pages 737–738, Washing-ton, DC, USA. IEEE Computer Society.

NetFlow (2006).Cisco IOS NetFlow. Retrieved February,2007, fromhttp://www.cisco.com .

NWS (2006). NWS: Network Weather Service. RetrievedFebruary, 2007, fromhttp://nws.cs.ucsb.edu/ewiki/ .

OptorSIM (2006). A Grid simulator. RetrievedFebruary, 2007, fromhttp://www.gridpp.ac.uk/demos/optorsimapplet/ .

Plank, J. S., Atchley, S., Ding, Y., and Beck, M. (2003).Algorithms for high performance, wide-area dis-tributed file downloads.Parallel Processing Letters,13(2):207–224.

Ranganathan, K. and Foster, I. (2002). Decoupling compu-tation and data scheduling in distributed data-intensiveapplications. InHPDC ’02: Proceedings of the 11 thIEEE International Symposium on High PerformanceDistributed Computing HPDC-11 20002 (HPDC’02),page 352, Washington, DC, USA. IEEE Computer So-ciety.

Ranganathan, K. and Foster, I. (2004). Computationscheduling and data replication algorithms for datagrids.Grid resource management: state of the art andfuture trends, pages 359–373.

Ranganathan, K. and Foster, I. T. (2003). Simulation stud-ies of computation and data scheduling algorithms fordata grids.J. Grid Comput., 1(1):53–62.

Root (2006). Object Oriented Data Analysis Framework.European Organization for Nuclear Research. Re-trieved February, 2007, fromhttp://root.cern.ch .

sTCP (2006). Scalable TCP. Retrieved Febru-ary, 2007, fromhttp://www.deneholme.net/tom/scalable/ .

Stork (2006).STORK: A Scheduler for Data Placement Ac-tivities in the Grid. Retrieved February, 2007, fromhttp://www.cs.wisc.edu/condor/stork/ .

Stream (2006).Stream Control Transmission Protocol. Re-trieved February, 2007, fromhttp://tools.ietf.org/html/rfc2960 .

Taylor, I., Shields, M., Wang, I., and Harrison, A. (2005).Visual Grid Workflow in Triana.Journal of Grid Com-puting, 3(3-4):153–169.

Taylor, I., Shields, M., Wang, I., and Harrison, A. (2007).The Triana Workflow Environment: Architecture andApplications. In Taylor, I., Deelman, E., Gannon,D., and Shields, M., editors,Workflows for e-Science,pages 320–339, Secaucus, NJ, USA. Springer, NewYork.

Thain, D., Tannenbaum, T., and Linvy, M. (2003). GridComputing: Making the Global Infrastructure a Re-ality. In Condor and the Grid, number ISBN:0-470-85319-0, pages 299–336. John Wiley.

Tierney, B. L., Lee, J., Crowley, B., Holding, M., Hylton,J., and Drake, Jr., F. L. (1999). A network-awaredistributed storage cache for data-intensive environ-ments. InProceedings of the Eighth IEEE Interna-tional Symposium on High Performance DistributedComputing, pages 185–193, Redondo Beach, CA.IEEE Computer Society Press.

Tyson, J. A. (2002). Large Synoptic Survey Telescope:Overview. In Tyson, J. A. and Wolff, S., editors,Sur-vey and Other Telescope Technologies and Discover-ies. Edited by Tyson, J. Anthony; Wolff, Sidney. Pro-ceedings of the SPIE, Volume 4836, pp. 10-20 (2002).,pages 10–20.

Venugopal, S., Buyya, R., and Winton, L. (2004). A gridservice broker for scheduling distributed data-orientedapplications on global grids. InMGC ’04: Proceed-ings of the 2nd workshop on Middleware for gridcomputing, pages 75–80, New York, NY, USA. ACMPress.


61

data scheduling for large scale …csc.lsu.edu/~balman/pdfs/pub/iceis_balman_dc_1116.pdfdata...

Documents