big data in cloud computing: a resource management...

18
Review Article Big Data in Cloud Computing: A Resource Management Perspective Saeed Ullah , M. Daud Awan, and M. Sikander Hayat Khiyal Faculty of Computer Science, Preston University, Islamabad, Pakistan Correspondence should be addressed to Saeed Ullah; [email protected] Received 27 October 2017; Revised 11 February 2018; Accepted 25 March 2018; Published 9 May 2018 Academic Editor: Sungyong Park Copyright © 2018 Saeed Ullah et al. is is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. e modern day advancement is increasingly digitizing our lives which has led to a rapid growth of data. Such multidimensional datasets are precious due to the potential of unearthing new knowledge and developing decision-making insights from them. Analyzing this huge amount of data from multiple sources can help organizations to plan for the future and anticipate changing market trends and customer requirements. While the Hadoop framework is a popular platform for processing larger datasets, there are a number of other computing infrastructures, available to use in various application domains. e primary focus of the study is how to classify major big data resource management systems in the context of cloud computing environment. We identify some key features which characterize big data frameworks as well as their associated challenges and issues. We use various evaluation metrics from different aspects to identify usage scenarios of these platforms. e study came up with some interesting findings which are in contradiction with the available literature on the Internet. 1. Introduction We live in the information age, and an important measure- ment of present times is the amount of data that is gen- erated anywhere around us. Data is becoming increasingly valuable. Enterprises are aiming at unlocking data’s hidden potential and deliver competitive advantage [1]. Stratistics MRC projected that the data analytics and Hadoop market, which accounted for $8.48 billion in 2015, is expected to reach at $99.31 billion by 2022 [2]. e global big data market has estimated that it will jump from $14.87 billion in 2013 to $46.34 billion in 2018 [3]. Gartner has predicted that data will grow by 800 percent over the next five years and 80 percent of the data will be unstructured (e-mails, documents, audio, video, and social media content) and 20 percent will be structured (e-commerce transactions and contact information) [1]. Today’s largest scientific institution, CERN, produces over 200 PB of data per year in the Large Hadron Collider project (as of 2017). e amount of generated data on the Internet has already exceeded 2.5 exabytes per day. Within one minute, 400 hours of videos are uploaded on YouTube, 3.6 million Google searches are conducted worldwide each minute of every day, more than 656 million tweets are shared on Twitter, and more than 6.5 million pictures are shared on Instagram each day. When a dataset becomes so large that its storage and processing become challenging due to the constraints of existing tools and resources, the dataset is referred to as big data [4, 5]. It is the first part of the journey towards delivering decision-making insights. But instead of focusing on people, this process utilizes a much more powerful and evolving technology, given the latest breakthroughs in this field, to quickly analyze huge streams of data, from a variety of sources, and to produce one single stream of useful knowledge [6]. Big data applications might be viewed as the advancement of parallel computing, but with the important exception of the scale. e scale is the necessity arising from the nature of the target issues: data dimensions largely exceed conventional storage units, the level of parallelism needed to perform computation within a strict deadline is high, and obtaining final results requires the aggregation of large numbers of partial results. e scale factor, in this case, does not only have the same effect that it has in classical parallel computing, but Hindawi Scientific Programming Volume 2018, Article ID 5418679, 17 pages https://doi.org/10.1155/2018/5418679

Upload: others

Post on 24-May-2020

9 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

Review ArticleBig Data in Cloud Computing A ResourceManagement Perspective

Saeed Ullah M Daud Awan andM Sikander Hayat Khiyal

Faculty of Computer Science Preston University Islamabad Pakistan

Correspondence should be addressed to Saeed Ullah saeedullahgmailcom

Received 27 October 2017 Revised 11 February 2018 Accepted 25 March 2018 Published 9 May 2018

Academic Editor Sungyong Park

Copyright copy 2018 Saeed Ullah et al This is an open access article distributed under the Creative Commons Attribution Licensewhich permits unrestricted use distribution and reproduction in any medium provided the original work is properly cited

The modern day advancement is increasingly digitizing our lives which has led to a rapid growth of data Such multidimensionaldatasets are precious due to the potential of unearthing new knowledge and developing decision-making insights from themAnalyzing this huge amount of data from multiple sources can help organizations to plan for the future and anticipate changingmarket trends and customer requirementsWhile the Hadoop framework is a popular platform for processing larger datasets thereare a number of other computing infrastructures available to use in various application domainsThe primary focus of the study ishow to classify major big data resourcemanagement systems in the context of cloud computing environmentWe identify some keyfeatures which characterize big data frameworks as well as their associated challenges and issuesWe use various evaluationmetricsfrom different aspects to identify usage scenarios of these platforms The study came up with some interesting findings which arein contradiction with the available literature on the Internet

1 Introduction

We live in the information age and an important measure-ment of present times is the amount of data that is gen-erated anywhere around us Data is becoming increasinglyvaluable Enterprises are aiming at unlocking datarsquos hiddenpotential and deliver competitive advantage [1] StratisticsMRC projected that the data analytics and Hadoop marketwhich accounted for $848 billion in 2015 is expected toreach at $9931 billion by 2022 [2] The global big datamarket has estimated that it will jump from $1487 billionin 2013 to $4634 billion in 2018 [3] Gartner has predictedthat data will grow by 800 percent over the next five yearsand 80 percent of the data will be unstructured (e-mailsdocuments audio video and social media content) and 20percent will be structured (e-commerce transactions andcontact information) [1]

Todayrsquos largest scientific institution CERN producesover 200 PB of data per year in the Large Hadron Colliderproject (as of 2017) The amount of generated data on theInternet has already exceeded 25 exabytes per day Withinone minute 400 hours of videos are uploaded on YouTube

36 million Google searches are conducted worldwide eachminute of every day more than 656 million tweets are sharedon Twitter and more than 65 million pictures are sharedon Instagram each day When a dataset becomes so largethat its storage and processing become challenging due tothe constraints of existing tools and resources the datasetis referred to as big data [4 5] It is the first part of thejourney towards delivering decision-making insights Butinstead of focusing on people this process utilizes a muchmore powerful and evolving technology given the latestbreakthroughs in this field to quickly analyze huge streamsof data from a variety of sources and to produce one singlestream of useful knowledge [6]

Big data applicationsmight be viewed as the advancementof parallel computing but with the important exception ofthe scale The scale is the necessity arising from the nature ofthe target issues data dimensions largely exceed conventionalstorage units the level of parallelism needed to performcomputation within a strict deadline is high and obtainingfinal results requires the aggregation of large numbers ofpartial resultsThe scale factor in this case does not only havethe same effect that it has in classical parallel computing but

HindawiScientific ProgrammingVolume 2018 Article ID 5418679 17 pageshttpsdoiorg10115520185418679

2 Scientific Programming

it surges towards a dimension in which automated resourcemanagement and its exploitation are of significant value [7]

An important factor for the success in big data analyticalprojects is the management of resources these platforms usea substantial amount of virtualized hardware resources tooptimize the tradeoff between costs and results Managingsuch resources is definitely a challenge Complexity is rootedin their architecture the first level of complexity stems fromtheir performance requirements of computing nodes typicalbig data applications utilize massively parallel computingresources storage subsystems and networking infrastructurebecause of the fact that results are required within a certaintime frame or they can lose their value over time Hetero-geneity is a technological need evolvability extensibility andmaintainability of the hardware layer imply that the systemwill be partially integrated replaced or extended by meansof new parts according to the availability on the marketand the evolution of technology [7] Another importantconsideration of modern applications is the massive amountof data that need to be processed Such data usually originatefrom different sets of devices (eg public web businessapplications satellites or sensors) and procedures (eg casestudies observational studies or simulations)Therefore it isimperative to develop computational architectures with evenbetter performance to support current and future applicationneeds Historically this need for computational resources wasprovided by high-performance computing (HPC) environ-ments such as computer clusters supercomputers and gridsIn traditional owner-centric HPC environments internalresources are handled by a single administrative domain[19] Cluster computing is the leading architecture for thisenvironment In distributed HPC environments such asgrid computing virtual organizations manage the provi-sioning of resources both internal and external to meetapplication needs [20] However the paradigm shift towardscloud computing has been widely discussed in more recentresearches [19 21] targeting the execution of HPCworkloadson cloud computing environments Although organizationsusually prefer to store their most sensitive data internally(on-premises) huge volumes of big data (owned by theenterprises or generated by third parties) may be storedexternally some of it may already be on a cloud Retainingall data sources behind the firewall may result in a significantwaste of resources Analyzing the data where it resides eitherinternally or in a public cloud data center makes more sense[1 22]

Even if cloud computing has to be an enabler to thegrowth of big data applications common cloud computingsolutions are rather different from big data applications Typ-ically cloud computing solutions offer fine-grained looselycoupled applications run to serve large numbers of users thatoperate independently from multiple locations possibly onown private nonshared data with a significant amount ofinteractions rather than being mainly batch-oriented andgenerally fit to be relocated with highly dynamic resourceneeds Despite such differences cloud computing and bigdata architectures share a number of common requirementssuch as automated (or autonomic) fine-grained resourcemanagement and scaling related issues [7]

As cloud computing begins to mature a large numberof enterprises are building efficient and agile cloud envi-ronments and cloud providers continue to expand serviceofferings [1] Microsoftrsquos cloud Hadoop offering includesAzure Marketplace which runs Cloudera Enterprise MapRandHortonworks Data Platform (HDP) in a virtual machineand Azure Data Lake which includes Azure HDInsight DataLake Analytics and Data Lake Store as managed servicesThe platform offers rich productivity suites for databasedata warehouse cloud spreadsheet collaboration businessintelligence OLAP and development tools delivering agrowing Hadoop stack to Microsoft community AmazonWeb Services reigns among the leaders of cloud computingand big data solutions Amazon EMR is available across 14regions worldwide AWS offers versions of Hadoop SparkTez and Presto that can work off data stored in Amazon S3and Amazon Glacier Cloud Dataproc is Googlersquos managedHadoop and Spark cluster to use fullymanaged cloud servicessuch as Google BigQuery and Bigtable IBM differentiatesBigInsights with end-to-end advanced analytics IBM BigIn-sights runs on top of IBMrsquos SoftLayer cloud infrastructureand can be deployed on more than 30 global data centersIBM is making significant investments in Spark BigQualityBigIntegrate and IBM InfoSphere BigMatch that run nativelywith YARN to handle the toughest Hadoop use cases [23]

In this paper we give an overview of some of the mostpopular and widely used big data frameworks in the contextof cloud computing environment which are designed to copewith the above-mentioned resource management and scalingproblems The primary object of the study is how to classifydifferent big data resource management systems We usevarious evaluation metrics for popular big data frameworksfrom different aspects We also identify some key featureswhich characterize big data frameworks as well as theirassociated challenges and issues We restricted our studyselection criteria to empirical studies from existing literaturewith reported evidence on performance evaluation of bigdata resource management frameworks To the best of ourknowledge thus far there has been no empirical based per-formance evaluation report on major resource managementframeworks We investigated the validity of existing researchby performing a confirmatory study For this purpose thestandard performance evaluation tests as well as custom loadtest cases were performed on a 10+1 nodes t22xlarge AmazonAWS cluster For experimentation and benchmarking we fol-lowed the same process as outlined in our earlier study [24]

The study came up with some interesting findings whichare in contradiction with the available literature on theInternet The novelty of the study includes the categorizationof cloud-based big data resource management frameworksaccording to their key features comparative evaluation of thepopular big data frameworks and the best practices related tothe use of big data frameworks in the cloud

The inclusion and exclusion criteria for relevant researchstudies are as follows

(i) We selected only those resource management frame-works for whichwe found empirical evidence of beingoffered by various cloud providers

Scientific Programming 3

ComputeInfrastructure

Batch Streaming

Hadoop Continuous StreamingMicro Batch

SparkCheck pointing based Acknowledge based

Flink

Samza

Storm

BIGDATA

Figure 1 Classification of big data resource management frameworks

(ii) Several vendors offer their proprietary solutions forbig data analysis which could be the potential candi-date for comparative analysis being conducted in thisstudy However these frameworks were not selectedbased on two reasons Firstly most of these solutionsare the extension of open-source solution and hencethese exhibit the identical perform results in mostof the cases Secondly for our empirical studiesresearchers mostly prefer open-source solutions asthe documentation usage scenarios source code andother relevant details are freely available Hence weselected open-source solutions for the performanceevaluation

(iii) We did not include the frameworks which are nowdeprecated or discounted such as Apache S4 in favorof other resource management systems

This paper is organized as follows Section 2 reviews thepopular resource management frameworks The comparisonof big data frameworks is presented in Section 3 Based on thecomparative evaluation we categorize these systems in Sec-tion 4 Related work is presented in Section 5 and finallywe present conclusion and possible future directions inSection 6

2 Big Data ResourceManagement Frameworks

Big data is offering new emerging trends and opportunities tounearth operational insight towards data management Themost challenging issues for organizations are often that theamount of data is massive which needs to be processed at anoptimal speed to synthesize relevant results Analyzing such

huge amount of data from multiple sources can help orga-nizations plan for the future and anticipate changing markettrends and customer requirements In many of the cases bigdata is analyzed in batch mode However in many situationswe may need to react to the current state of data or analyzethe data that is in motion (data that is constantly coming inand needs to be processed immediately) These applicationsrequire a continuous stream of often unstructured data tobe processed Therefore data is continuously analyzed andcached in memory before it is stored on secondary storagedevices Processing streams of data works by filtering in-memory tables of data across a cluster of servers Any delayin the data analysis can seriously impact customer satisfactionor may result in project failure [25]

While the Hadoop framework is a popular platformfor processing huge datasets in parallel batch mode usingcommodity computational resources there are a numberof other computing infrastructures that can be used invarious application domains The primary focus of this studyis to investigate popular big data resource managementframeworks which are commonly used in cloud computingenvironment Most of the popular big data tools available forcloud computing platform including the Hadoop ecosystemare available under open-source licenses One of the keyappeals of Hadoop and other open-source solutions is thelow total cost of ownership While proprietary solutionshave expensive license fees and may require more costlyspecialized hardware these open-source solutions have nolicensing fees and can run on industry-standard hardware[14] Figure 1 demonstrates the classification of various stylesof processing architectures of open-source big data resourcemanagement frameworks

In the subsequent section we discuss various open-source big data resource management frameworks that are

4 Scientific Programming

widely used in conjunction with cloud computing environ-ment

21 Hadoop Hadoop [26] is a distributed programming andstorage infrastructure based on the open-source implementa-tion of theMapReducemodel [27]MapReduce is the first andcurrent de facto programming environment for developingdata-centric parallel applications for parsing and processinglarge datasetsTheMapReduce is inspired byMap andReduceprimitives used in functional programming In MapReduceprogramming users only have to write the logic of Mapperand Reducer while the process of shuffling partitioning andsorting is automatically handled by the execution engine [1427 28]The data can either be saved in theHadoop file systemas unstructured data or in a database as structured data [14]Hadoop Distributed File System (HDFS) is responsible forbreaking large data files into smaller pieces known as blocksThe blocks are placed on different data nodes and it is thejob of the NameNode to notice what blocks on which datanodes make up the complete file The NameNode also worksas a traffic cop handling all access to the files including readswrites creates deletes and replication of data blocks on thedata nodes A pipeline is a link between multiple data nodesthat exists to handle the transfer of data across the servers Auser application pushes a block to the first data node in thepipeline The data node takes over and forwards the block tothe next node in the pipeline this continues until all the dataand all the data replicas are saved to disk Afterwards theclient repeats the process by writing the next block in the file[25]

The two major components of Hadoop MapReduce arejob scheduling and tracking The early versions of Hadoopsupported limited job and task tracking system In particularthe earlier scheduler could not manage non-MapReducetasks and it was not capable of optimizing cluster utiliza-tion So a new capability was aimed at addressing theseshortcomings which may offer more flexibility scaling effi-ciency and performance Because of these issues Hadoop 20was introduced Alongside earlier HDFS resource manage-ment and MapReduce model it introduced a new resourcemanagement layer called Yet Another Resource Negotia-tor (YARN) that takes care of better resource utilization[25]

YARN is the core Hadoop service to provide two majorfunctionalities global resource management (ResourceMan-ager) and per-application management (ApplicationMaster)The ResourceManager is a master service which controlsNodeManager in each of the nodes of a Hadoop cluster Itincludes a scheduler whose main task is to allocate systemresources to specific running applications All the requiredsystem information is tracked by a Resource Containerwhich monitors CPU storage network and other importantresource attributes necessary for executing applications inthe cluster The ResourceManager has a slave NodeManagerservice to monitor application usage statistics Each deployedapplication is handled by a corresponding ApplicationMas-ter service If more resources are required to support therunning application the ApplicationMaster requests theNodeManager and the NodeManager negotiates with the

ResourceManager (scheduler) for the additional capacity onbehalf of the application [26]

22 Spark Apache Spark [29] originally developed as Berke-ley Spark was proposed as an alternative to Hadoop It canperform faster parallel computing operations by using in-memory primitives A job can load data in either local mem-ory or a cluster-wide shared memory and query it iterativelywith much great speed as compared to disk-based systemssuch as Hadoop MapReduce [27] Spark has been devel-oped for two applications where keeping data in memorymay significantly improve performance iterative machinelearning algorithms and interactive data mining Spark isalso intended to unify the current processing stack wherebatch processing is performed using MapReduce interac-tive queries are performed using HBase and the processingof streams for real-time analytics is performed using otherframeworks such Twitterrsquos Storm Spark offers programmersa functional programming paradigm with data-centric pro-gramming interfaces built on top of a new data model calledResilient Distributed Dataset (RDD) which is a collection ofobjects spread across a cluster stored in memory or disk [28]Applications in Spark can load these RDDs into the memoryof a cluster of nodes and let the Spark engine automaticallymanage the partitioning of the data and its locality duringruntime This versatile iterative model makes it possible tocontrol the persistence and manage the partitioning of dataA stream of incoming data can be partitioned into a seriesof batches and is processed as a sequence of small-batchjobs The Spark framework allows this seamless combinationof streaming and batch processing in a unified system Toprovide rapid application development Spark provides cleanconcise APIs in Scala Java and Python Spark can be usedinteractively from the Scala and Python shells to rapidlyquery big datasets

Spark is also the engine behind Shark a completeApache Hive-compatible data warehousing system that canrun much faster than Hive Spark also supports data accessfrom Hadoop Spark fits in seamlessly with the Hadoop 20ecosystem (Figure 2) as an alternative to MapReduce whileusing the same underlying infrastructure such as YARN andthe HDFS Spark is also an integral part of the SMACKstack to provide the most popular cloud-native PaaS suchas IoT predictive analytics and real-time personalizationfor big data In SMACK Apache Mesos cluster manager(instead of YARN) is used for dynamic allocation of clusterresources not only for running Hadoop applications but alsofor handling heterogeneous workloads

The GraphX and MLlib libraries include state-of-the-art graph and machine learning algorithms that can beexecuted in real time BlinkDB is a novel parallel sampling-based approximate query engine for running interactive SQLqueries that trade off query accuracy for response time withresults annotated bymeaningful error bars BlinkDBhas beenproven to run 200 times faster than Hive within an errorrate of 2ndash10 Moreover Spark provides an interactive toolcalled Spark Shell which allows exploiting the Spark clusterin real time Once interactive applications are created theymay subsequently be executed interactively in the cluster

Scientific Programming 5

MapReduce (Distributed Data Processing)

TEZ (Execution Engine)

Hardware

Hypervisor

OS

JVM

HDFS (Distributed File System)

YARN (Cluster Resource Management)

HbaseColumnOrientednoSQL

database

File DataLanguagePIG (Data

flow)

LanguageHive

(Queries SQL)

SearchLuceneOozie

(WorkflowScheduling)

Zookeeper (coordination)

Figure 2 Hadoop 20 ecosystem source [14]

MLlib SparkSQL Spark Streaming GraphX

Driver ProgramSpark Context

Cluster ManagerWorker Node

Worker Node Worker Node

Worker Node

Cluster Manager(i) Spark Standalone

(ii) YARN(iii) Mesos

DBMSHDFS Laptop Server

Figure 3 Spark architecture source [15]

In Figure 3 we present the general Spark system architec-ture

23 Flink Apache Flink is an emerging competitor of Sparkwhich offers functional programming interfaces much sim-ilar to Spark It shares many programming primitives andtransformations in the same way as what Spark does foriterative development predictive analysis and graph streamprocessing Flink is developed to fill the gap left by Sparkwhich uses minibatch streaming processing instead of apure streaming approach Flink ensures high processingperformance when dealing with complex big data structuressuch as graphs Flink programs are regular applicationswhichare written with a rich set of transformation operations (suchas mapping filtering grouping aggregating and joining) tothe input datasetsTheFlink dataset uses a table-basedmodeltherefore application developers can use index numbers tospecify a particular field of a dataset [27 28]

Flink is able to achieve high throughput and a low latencythereby processing a bundle of data very quickly Flink is

designed to run on large-scale clusters with thousands ofnodes and in addition to a standalone cluster mode Flinkprovides support for YARN For distributed environmentFlink chains operator subtasks together into tasks Each taskis executed by one thread [16] Flink runtime consists of twotypes of processes there is at least one JobManager (alsocalled masters) which coordinates the distributed executionIt schedules tasks coordinates checkpoints and coordinatesrecovery on failures A high-availability setup may involvemultiple JobManagers one of which one is always the leaderand the others are standby The TaskManagers (also calledworkers) execute the tasks (ormore specifically the subtasks)of a dataflowbuffer and exchange the data streams Theremust always be at least one TaskManager The JobManagersand TaskManagers can be started in various ways directly onthe machines as a standalone cluster in containers or man-aged by resource frameworks like YARN orMesos TaskMan-agers connect to JobManagers announcing themselves asavailable and are assigned work Figure 4 demonstrates themain components of Flink framework

24 Storm Storm [17] is a free open-source distributedstream processing computation framework It takes severalcharacteristics from the popular actor model and can beused with practically any kind of programming language fordeveloping applications such as real-time streaming analyticscritical work flow systems and data delivery services Theengine may process billions of tuples each day in a fault-tolerant way It can be integrated with popular resourcemanagement frameworks such as YARNMesos and DockerApache Storm cluster is made up of two types of processingactors spouts and bolts

(i) Spout is connected to the external data source of astream and is continuously emitting or collecting newdata for further processing

(ii) Bolt is a processing logic unit within a streamingprocessing topology each bolt is responsible for a

6 Scientific Programming

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

Data Streams

(Worker) (Worker)

Job Manager

Actor System

Scheduler

CheckpointCoordinator

Dataflow graph

Task StatusHeartbeatsStatistics

Deploystopcancel tasks

Triggercheckpoints

(MasterYARN Application Master)

Flink Program

Program code

OptimizerGraph Builder

Programdataflow

Client

ActorSystem

Dataflow graph

Statusupdates

Statisticsand

results

Cancelupdate job

Submit job (send dataflow)

Figure 4 Architectural components of Flink source [16]

certain processing task such as transformation filter-ing aggregating and partitioning

Storm defines workflow as directed acyclic graphs (DAGs)called topologies with connected spouts and bolts as verticesEdges in the graph define the link between the bolts andthe data stream Unlike batch jobs being only executedonce Storm jobs run forever until they are killed Thereare two types of nodes in a Storm cluster nimbus (masternode) and supervisor (worker node) Nimbus similar toHadoop JobTracker is the core component of Apache Stormand is responsible for distributing load across the clusterqueuing and assigning tasks to different processing units andmonitoring execution status Each worker node executes aprocess known as the supervisor which may have one ormore worker processes Supervisor delegates the tasks toworker processes Worker process then creates a subset oftopology to run the task Apache Storm does rely on aninternal distributed messaging system called Netty for thecommunication between nimbus and supervisors Zookeepermanages the communication between real-time job trackers(nimbus) and supervisors (Storm workers) Figure 5 outlinesthe high-level view of Storm cluster

25 Apache Samza Apache Samza [18] is a distributed streamprocessing framework mainly written in Scala and JavaOverall it has a relatively high throughput as well as some-what increased latency when compared to Storm [8] It usesApache Kafka which was originally developed for LinkedInfor messaging and streaming while Apache Hadoop YARNMesos is utilized as an execution platform for overall resourcemanagement Samza relies on Kafkarsquos semantics to define theway streams are handled Its main objective is to collect anddeliver massively large volumes of event data in particularlog data with a low latency A Kafka systemrsquos architecture iscomparatively simple as it only consists of a set of brokerswhich are individual nodes that make up a Kafka clusterData streams are defined by topics which is a stream ofrelated information that consumers can subscribe to Topicsare divided into partitions that are distributed over the brokerinstances for retrieving the corresponding messages using apull mechanismThe basic flow of job execution is presentedin Figure 6

Tables 1 and 2 present a brief comparative analysis of theseframeworks based on some common attributes As shownin the tables MapReduce computation data flow followschain of stages with no loop At each stage the program

Scientific Programming 7

Table1Com

paris

onof

bigdatafram

eworks

Attribute

Fram

ework

Hadoo

pSpark

Storm

Samza

Flink

Currentstablev

ersio

n281

220

111

0130

132

Batchprocessin

gYes

Yes

Yes

No

Yes

Com

putatio

nalm

odel

MapRe

duce

Stream

ing

(microbatches)

Stream

ing

(microbatches)

Stream

ing

Supp

ortscontinuo

usflo

wstream

ingmicrobatchandbatch

Datafl

owCh

ainof

stages

Dire

cted

acyclic

graph

Dire

cted

acyclic

graphs

(DAG

s)with

spou

tsandbo

lts

Stream

s(acyclic

graph)

Con

trolledcyclicd

ependency

graphthroug

hmachine

learning

Resource

managem

ent

YARN

YARN

Mesos

HDFS

(YARN

)Mesos

YARN

Mesos

ZookeeperYA

RNM

esos

Lang

uage

supp

ort

Allmajor

lang

uages

JavaScalaP

ytho

nandR

Any

programming

lang

uage

JVM

lang

uages

JavaScalaP

ytho

nandR

Jobmanagem

ento

ptim

ization

MapRe

duce

approach

Catalystextension

Storm-YARN

3rd-

partytoolslike

Ganglia

InternalJobR

unner

Internalop

timizer

Interactivem

ode

Non

e(3rd-partytools

likeImpalacanbe

integrated)

Interactives

hell

Non

eLimitedAPI

ofKa

fka

streams

Scalas

hell

Machine

learning

librarie

sAp

ache

Mahou

tH2O

SparkMLandMLlib

Trident-M

LAp

ache

SAMOA

Apache

SAMOA

Flink-ML

Maxim

umrepo

rted

nodes(scalability)

Yaho

oHadoo

pclu

ster

with

42000

nodes

8000

300

Link

edIn

with

arou

ndah

undred

node

cluste

rs

Alib

abac

ustomized

Flinkclu

ster

with

thou

sand

sofn

odes

8 Scientific Programming

Table 2 Comparative analysis of big data resource frameworks (119904 = 5)

Hadoop Spark Flink Storm SamzaProcessing speed permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermilFault tolerance permilpermilpermilpermil permilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilScalability permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil permilpermilpermilMachine learning permilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilLow latency permilpermil permilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilSecurity permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermil NDataset size permilpermilpermilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil

Task

Task

Task

Task

Executer Executer

Worker ProcessSupervisor

Worker node

Worker Process

Worker ProcessSupervisor

Worker node

Workers

NimbusMaster Node

Storm Cluster

Zookeeper Framework

Figure 5 Architecture of Storm Cluster source [17]

Kafka

Task

Task Task

YARN NodeManager

Machine

Samza Container

Samza Container

Samza Job

KafkaTopic 1

KafkaTopic 2

KafkaTopic 3

Figure 6 Samza architecture source [18]

proceeds with the output from the previous stage andgenerates an input for the next stage Although machinelearning algorithms are mostly designed in the form of cyclicdata flow Spark Storm and Samza represent it as directedacyclic graph to optimize the execution plan Flink supportscontrolled cyclic dependency graph in runtime to representthe machine learning algorithms in a very efficient way

Hadoop and Storm do not provide any default interactiveenvironment Apache Spark has a command-line interactiveshell to use the application features Flink provides a Scalashell to configure standalone as well as cluster setup ApacheHadoop is highly scalable and it has been used in theYahoo production consisting of 42000 nodes in 20 YARNclusters The largest known cluster size for Spark is of 8000computing nodes while Stormhas been tested on amaximumof 300 node clusters Apache Samza cluster with around ahundred nodes has been used in LinkedIn data flow andapplication messaging system Apache Flink has been custo-mized for Alibaba search engine with a deployment capacityof thousands of processing nodes

3 Comparative Evaluation ofBig Data Frameworks

Big data in cloud computing a popular research trendis posing significant influence on current enterprises ITindustries and research communities There are a numberof disruptive and transformative big data technologies andsolutions that are rapidly emanating and evolving in orderto provide data-driven insight and innovation Furthermoremodern cloud computing services are offering all kinds ofbig data analytic tools technologies and computing infras-tructures to speed up the data analysis process at an afford-able cost Although many distributed resource managementframeworks are available nowadays the main issue is how to

Scientific Programming 9

select a suitable big data framework The selection of one bigdata platform over the others will come down to the specificapplication requirements and constraints that may involveseveral tradeoffs and application usage scenarios Howeverwe can identify some key factors that need to be fulfilledbefore deploying a big data application in the cloud In thissection based on some empirical evidence from the availableliterature we discuss the advantages and disadvantages ofeach resource management framework

31 Processing Speed Processing speed is an important per-formance measurement that may be used to evaluate theeffectiveness of different resource management frameworksIt is a commonmetric for the maximum number of IO oper-ations to disk ormemory or the data transfer rate between thecomputational units of the cluster over a specific amount oftime Based on the context of big data the average processingspeed represented as119898 calculated after 119899 iterations run is themaximum amount of memorydisk intensive operations thatcan be performed over a time interval 119905119894

119898 = sum119899119894=1119898119894sum119899119894=1 119905119894 (1)

Veiga et al [30] conducted a series of experiments on amulticore cluster setup to demonstrate performance resultsof Apache Hadoop Spark and Flink Apache Spark andFlink resulted to be much efficient execution platforms overHadoop while performing nonsort benchmarks It was fur-ther noted that Spark showed better performance results foroperations such asWordCount and119870-Means (CPU-bound innature) while Flink achieved better results in PageRank algo-rithm (memory bound in nature) Mavridis and Karatza [31]experimentally compared performance statistics of ApacheHadoop and Spark onOkeanos IaaS cloud platform For eachset of experiments necessary statistics related to executiontime working nodes and the dataset size were recordedSpark performance was found optimal as compared toHadoop for most of the cases Furthermore Spark on YARNplatform showed suboptimal results as compared to the casewhen it was executed in standalone mode Some similarresults were also observed by Zaharia et al [32] on a 100GBdataset record Vellaipandiyan and Raja [33] demonstratedperformance evaluation and comparison of Hadoop andSpark frameworks on residentrsquos record dataset ranging from100GB to 900GB of size Spark scale of performance wasrelatively better when the dataset size was between small andmedium size (100GBndash750GB) afterwards its performancedeclined as compared to HadoopThe primary reason for theperformance decline was evident as Spark cache size couldnot fit into the memory for the larger dataset Taran et al[34] quantified performance differences ofHadoop and Sparkusing WordCount dataset which was ranging from 100KBto 1GB It was observed that Hadoop framework was fivetimes faster than Spark when the evaluation was performedusing a larger set of data sources However for the smallertasks Spark showed better performance results However thespeed-up ratio was decreased for both databases with thegrowth of input dataset

Gopalani and Arora [35] used 119870-Means algorithm onsome small medium and large location datasets to compareHadoop and Spark frameworksThe study results showed thatSpark performed up to three times better than MapReducefor most of the cases Bertoni et al [36] performed theexperimental evaluation of Apache Flink and Storm usinglarge genomic dataset data on Amazon EC2 cloud ApacheFlink was superior to Storm while performing histogramand map operations while Storm outperformed Flink whilegenomic join application was deployed

32 Fault Tolerance Fault tolerance is the characteristicthat enables a system to continue functioning in case ofthe failure of one or more components High-performancecomputing applications involve hundreds of nodes that areinterconnected to perform a specific task failing a nodeshould have zero or minimal effect on overall computationThe tolerance of a system represented as TolFT to meet itsrequirements after a disruption is the ratio of the time tocomplete tasks without observing any fault events to theoverall execution time where some fault events were detectedand the system state is reverted back to consistent state

TolFT =119879119909119879119909 + 1205902 (2)

where 119879119909 is the estimated correct execution time obtainedfrom a program run that is presumed to be fault-free or byaveraging the execution time from several application runsthat produce a known correct output and 1205902 represents thevariance in a programrsquos execution time due to the occurrenceof fault events For anHPC application that consists of a set ofcomputationally intensive tasks Γ = 1205911 1205912 120591119899 and sinceTolFT for each individual task is computed as Tol120591119894 then theoverall application resilience Tol may be calculated as [37]

Tol = 119899radicTol1205911 sdot Tol1205912 sdot sdot Tol120591119899 (3)

Lu et al [38] used StreamBench toolkit to evaluate perfor-mance and fault tolerance ability of Apache Spark StormSpark and Samza It was found that with the increased datasize Spark is much stable and fault-tolerant as compared toStorm but may be less efficient when compared to SamzaFurthermore when compared in terms of handling largecapacity of data both Samza and Spark outperformed StormGu and Li [39] used PageRank algorithm to perform acomparative experiment on Hadoop and Spark frameworksIt was observed that for smaller datasets such as wiki-Voteand soc-Slashdot0902 Spark outperformed Hadoop with amuch better margin However this speed-up result degradedwith the growth of dataset and for large datasets Hadoopeasily outperformed Spark Furthermore for massively largedatasets Spark was reported to be crashed with JVM heapexception while Hadoop still performed its task Lopez et al[40] evaluated throughput and fault tolerance mechanism ofApache Storm and Flink The experiments were based on athreat detection system where Apache Storm demonstratedbetter throughput as compared to Flink For fault tolerance

10 Scientific Programming

different virtual machines were manually turned off to ana-lyze the impact of node failures Apache Flink used its internalsubsystem to detect and migrate the failed tasks to othermachines and hence resulted in very few message lossesOn the other hand Storm took more time as Zookeeperinvolving some performance overhead was responsible forreporting the state of nimbus and thereafter processing thefailed task on other nodes

33 Scalability Scalability refers to the ability to accommo-date large loads or change in sizeworkload by provisioning ofresources at runtimeThis can be further categorized as scale-up (by making hardware stronger) or scale-down (by addingadditional nodes) One of the critical requirements of enter-prises is to process large volumes of data in a timely mannerto address high-value business problems Dynamic resourcescalability allows business entities to perform massive com-putation in parallel thus reducing overall time complexityand effort The definition of scalability comes from Amdahlrsquosand Gustafsonrsquos laws [41] Let 119882 be the size of workloadbefore the improvement of the system resources the fractionof the execution workload that benefits from the improve-ment of system resources is 120572 and the fraction concerningthe part that would not benefit from improvement in theresources is 1 minus 120572 When using an 119899-processor system userworkload is scaled to

= 120572119882 + (1 minus 120572) 119899119882 (4)

The parallel execution time of a scaled workload on 119899-processors is defined as scaled-workload speed-up 119878 as shownin

119878 = 119882 =120572119882 + (1 minus 120572) 119899119882119882 (5)

Garcıa-Gil et al [42] performed scalability and performancecomparison of Apache Spark and Flink using feature selec-tion framework to assemble multiple information theoreticcriteria into a single greedy algorithm The ECBDL14 datasetwas used to measure scalability factor for the frameworksIt was observed that Spark scalability performance was 4ndash10times faster than Flink Jakovits and Srirama [43] analyzedfour MapReduce based frameworks including Hadoop andSpark for benchmarking partitioning aroundMedoids Clus-tering Large Applications and Conjugate Gradient linearsystem solver algorithms using MPI All experiments wereperformed on 33 Amazon EC2 large instances cloud Forall algorithms Spark performed much better as comparedto Hadoop in terms of both performance and scalabilityBoden et al [44] aimed to investigate the scalability withrespect to both data size and dimensionality in order todemonstrate a fair and insightful benchmark that reflects therequirements of real-world machine learning applicationsThe benchmark was comprised of distributed optimizationalgorithms for supervised learning as well as algorithms forunsupervised learning For supervised learning they imple-mented machine learning algorithms by using Breeze librarywhile for unsupervised learning they chose the popular 119896-Means clustering algorithm in order to assess the scalability

of the two resource management frameworks The overallexecution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes whileSpark had a clear edge once enough main memory wasavailable due to the addition of new computing nodes

34 Machine Learning and Iterative Tasks Support Big dataapplications are inherently complex in nature and usuallyinvolve tasks and algorithms that are iterative in natureTheseapplications have distinct cyclic nature to achieve the desiredresult by continually repeating a set of tasks until these cannotbe substantially reduced further

Spangenberg et al [45] used real-world datasets consist-ing of four algorithms that is WordCount119870-Means PageR-ank and relational query to benchmark Apache Flink andStorm It was observed that Apache Storm performs better inbatchmode as compared to Flink However with the increas-ing complexity Apache Flink had a performance advantageover Storm and thus it was better suited for iterative dataor graph processing Shi et al [46] focused on analyz-ing Apache MapReduce and Spark for batch and iterativejobs For smaller datasets Spark resulted to be a better choicebut when experiments were performed on larger datasetsMapReduce turned out to be several times faster than SparkFor iterative operations such as 119870-Means Spark turned outto be 15 times faster as compared to MapReduce in its firstiteration while Spark was more than 5 times faster in sub-sequent operations

Kang and Lee [47] examined five resource manage-ment frameworks including Apache Hadoop and Sparkwith respect to performance overheads (disk inputoutputnetwork communication scheduling etc) in supportingiterative computation The PageRank algorithm was used toevaluate these performance issues Since static data process-ing tends to be a more frequent operation than dynamic dataas it is used in every iteration of MapReduce it may causesignificant performance overhead in case of MapReduceOn the other hand Apache Spark uses read-only cachedversion of objects (resilient distributed dataset) which can bereused in parallel operations thus reducing the performanceoverhead during iterative computation Lee et al [48] evalu-ated five systems including Hadoop and Spark over variousworkloads to compare against four iterative algorithms Theexperimentation was performed on Amazon EC2 cloudOverall Spark showed the best performance when iterativeoperations were performed in main memory In contrast theperformance of Hadoop was significantly poor as comparedto other resource management systems

35 Latency Big data and low latency are strongly linkedBig data applications provide true value to businesses butthese are mostly time critical If cloud computing has to bethe successful platform for big data implementation one ofthe key requirements will be the provisioning of high-speednetwork to reduce communication latency Furthermore bigdata frameworks usually involve centralized design wherethe scheduler assigns all tasks through a single node whichmay significantly impact the latency when the size of data ishuge

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 2: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

2 Scientific Programming

it surges towards a dimension in which automated resourcemanagement and its exploitation are of significant value [7]

An important factor for the success in big data analyticalprojects is the management of resources these platforms usea substantial amount of virtualized hardware resources tooptimize the tradeoff between costs and results Managingsuch resources is definitely a challenge Complexity is rootedin their architecture the first level of complexity stems fromtheir performance requirements of computing nodes typicalbig data applications utilize massively parallel computingresources storage subsystems and networking infrastructurebecause of the fact that results are required within a certaintime frame or they can lose their value over time Hetero-geneity is a technological need evolvability extensibility andmaintainability of the hardware layer imply that the systemwill be partially integrated replaced or extended by meansof new parts according to the availability on the marketand the evolution of technology [7] Another importantconsideration of modern applications is the massive amountof data that need to be processed Such data usually originatefrom different sets of devices (eg public web businessapplications satellites or sensors) and procedures (eg casestudies observational studies or simulations)Therefore it isimperative to develop computational architectures with evenbetter performance to support current and future applicationneeds Historically this need for computational resources wasprovided by high-performance computing (HPC) environ-ments such as computer clusters supercomputers and gridsIn traditional owner-centric HPC environments internalresources are handled by a single administrative domain[19] Cluster computing is the leading architecture for thisenvironment In distributed HPC environments such asgrid computing virtual organizations manage the provi-sioning of resources both internal and external to meetapplication needs [20] However the paradigm shift towardscloud computing has been widely discussed in more recentresearches [19 21] targeting the execution of HPCworkloadson cloud computing environments Although organizationsusually prefer to store their most sensitive data internally(on-premises) huge volumes of big data (owned by theenterprises or generated by third parties) may be storedexternally some of it may already be on a cloud Retainingall data sources behind the firewall may result in a significantwaste of resources Analyzing the data where it resides eitherinternally or in a public cloud data center makes more sense[1 22]

Even if cloud computing has to be an enabler to thegrowth of big data applications common cloud computingsolutions are rather different from big data applications Typ-ically cloud computing solutions offer fine-grained looselycoupled applications run to serve large numbers of users thatoperate independently from multiple locations possibly onown private nonshared data with a significant amount ofinteractions rather than being mainly batch-oriented andgenerally fit to be relocated with highly dynamic resourceneeds Despite such differences cloud computing and bigdata architectures share a number of common requirementssuch as automated (or autonomic) fine-grained resourcemanagement and scaling related issues [7]

As cloud computing begins to mature a large numberof enterprises are building efficient and agile cloud envi-ronments and cloud providers continue to expand serviceofferings [1] Microsoftrsquos cloud Hadoop offering includesAzure Marketplace which runs Cloudera Enterprise MapRandHortonworks Data Platform (HDP) in a virtual machineand Azure Data Lake which includes Azure HDInsight DataLake Analytics and Data Lake Store as managed servicesThe platform offers rich productivity suites for databasedata warehouse cloud spreadsheet collaboration businessintelligence OLAP and development tools delivering agrowing Hadoop stack to Microsoft community AmazonWeb Services reigns among the leaders of cloud computingand big data solutions Amazon EMR is available across 14regions worldwide AWS offers versions of Hadoop SparkTez and Presto that can work off data stored in Amazon S3and Amazon Glacier Cloud Dataproc is Googlersquos managedHadoop and Spark cluster to use fullymanaged cloud servicessuch as Google BigQuery and Bigtable IBM differentiatesBigInsights with end-to-end advanced analytics IBM BigIn-sights runs on top of IBMrsquos SoftLayer cloud infrastructureand can be deployed on more than 30 global data centersIBM is making significant investments in Spark BigQualityBigIntegrate and IBM InfoSphere BigMatch that run nativelywith YARN to handle the toughest Hadoop use cases [23]

In this paper we give an overview of some of the mostpopular and widely used big data frameworks in the contextof cloud computing environment which are designed to copewith the above-mentioned resource management and scalingproblems The primary object of the study is how to classifydifferent big data resource management systems We usevarious evaluation metrics for popular big data frameworksfrom different aspects We also identify some key featureswhich characterize big data frameworks as well as theirassociated challenges and issues We restricted our studyselection criteria to empirical studies from existing literaturewith reported evidence on performance evaluation of bigdata resource management frameworks To the best of ourknowledge thus far there has been no empirical based per-formance evaluation report on major resource managementframeworks We investigated the validity of existing researchby performing a confirmatory study For this purpose thestandard performance evaluation tests as well as custom loadtest cases were performed on a 10+1 nodes t22xlarge AmazonAWS cluster For experimentation and benchmarking we fol-lowed the same process as outlined in our earlier study [24]

The study came up with some interesting findings whichare in contradiction with the available literature on theInternet The novelty of the study includes the categorizationof cloud-based big data resource management frameworksaccording to their key features comparative evaluation of thepopular big data frameworks and the best practices related tothe use of big data frameworks in the cloud

The inclusion and exclusion criteria for relevant researchstudies are as follows

(i) We selected only those resource management frame-works for whichwe found empirical evidence of beingoffered by various cloud providers

Scientific Programming 3

ComputeInfrastructure

Batch Streaming

Hadoop Continuous StreamingMicro Batch

SparkCheck pointing based Acknowledge based

Flink

Samza

Storm

BIGDATA

Figure 1 Classification of big data resource management frameworks

(ii) Several vendors offer their proprietary solutions forbig data analysis which could be the potential candi-date for comparative analysis being conducted in thisstudy However these frameworks were not selectedbased on two reasons Firstly most of these solutionsare the extension of open-source solution and hencethese exhibit the identical perform results in mostof the cases Secondly for our empirical studiesresearchers mostly prefer open-source solutions asthe documentation usage scenarios source code andother relevant details are freely available Hence weselected open-source solutions for the performanceevaluation

(iii) We did not include the frameworks which are nowdeprecated or discounted such as Apache S4 in favorof other resource management systems

This paper is organized as follows Section 2 reviews thepopular resource management frameworks The comparisonof big data frameworks is presented in Section 3 Based on thecomparative evaluation we categorize these systems in Sec-tion 4 Related work is presented in Section 5 and finallywe present conclusion and possible future directions inSection 6

2 Big Data ResourceManagement Frameworks

Big data is offering new emerging trends and opportunities tounearth operational insight towards data management Themost challenging issues for organizations are often that theamount of data is massive which needs to be processed at anoptimal speed to synthesize relevant results Analyzing such

huge amount of data from multiple sources can help orga-nizations plan for the future and anticipate changing markettrends and customer requirements In many of the cases bigdata is analyzed in batch mode However in many situationswe may need to react to the current state of data or analyzethe data that is in motion (data that is constantly coming inand needs to be processed immediately) These applicationsrequire a continuous stream of often unstructured data tobe processed Therefore data is continuously analyzed andcached in memory before it is stored on secondary storagedevices Processing streams of data works by filtering in-memory tables of data across a cluster of servers Any delayin the data analysis can seriously impact customer satisfactionor may result in project failure [25]

While the Hadoop framework is a popular platformfor processing huge datasets in parallel batch mode usingcommodity computational resources there are a numberof other computing infrastructures that can be used invarious application domains The primary focus of this studyis to investigate popular big data resource managementframeworks which are commonly used in cloud computingenvironment Most of the popular big data tools available forcloud computing platform including the Hadoop ecosystemare available under open-source licenses One of the keyappeals of Hadoop and other open-source solutions is thelow total cost of ownership While proprietary solutionshave expensive license fees and may require more costlyspecialized hardware these open-source solutions have nolicensing fees and can run on industry-standard hardware[14] Figure 1 demonstrates the classification of various stylesof processing architectures of open-source big data resourcemanagement frameworks

In the subsequent section we discuss various open-source big data resource management frameworks that are

4 Scientific Programming

widely used in conjunction with cloud computing environ-ment

21 Hadoop Hadoop [26] is a distributed programming andstorage infrastructure based on the open-source implementa-tion of theMapReducemodel [27]MapReduce is the first andcurrent de facto programming environment for developingdata-centric parallel applications for parsing and processinglarge datasetsTheMapReduce is inspired byMap andReduceprimitives used in functional programming In MapReduceprogramming users only have to write the logic of Mapperand Reducer while the process of shuffling partitioning andsorting is automatically handled by the execution engine [1427 28]The data can either be saved in theHadoop file systemas unstructured data or in a database as structured data [14]Hadoop Distributed File System (HDFS) is responsible forbreaking large data files into smaller pieces known as blocksThe blocks are placed on different data nodes and it is thejob of the NameNode to notice what blocks on which datanodes make up the complete file The NameNode also worksas a traffic cop handling all access to the files including readswrites creates deletes and replication of data blocks on thedata nodes A pipeline is a link between multiple data nodesthat exists to handle the transfer of data across the servers Auser application pushes a block to the first data node in thepipeline The data node takes over and forwards the block tothe next node in the pipeline this continues until all the dataand all the data replicas are saved to disk Afterwards theclient repeats the process by writing the next block in the file[25]

The two major components of Hadoop MapReduce arejob scheduling and tracking The early versions of Hadoopsupported limited job and task tracking system In particularthe earlier scheduler could not manage non-MapReducetasks and it was not capable of optimizing cluster utiliza-tion So a new capability was aimed at addressing theseshortcomings which may offer more flexibility scaling effi-ciency and performance Because of these issues Hadoop 20was introduced Alongside earlier HDFS resource manage-ment and MapReduce model it introduced a new resourcemanagement layer called Yet Another Resource Negotia-tor (YARN) that takes care of better resource utilization[25]

YARN is the core Hadoop service to provide two majorfunctionalities global resource management (ResourceMan-ager) and per-application management (ApplicationMaster)The ResourceManager is a master service which controlsNodeManager in each of the nodes of a Hadoop cluster Itincludes a scheduler whose main task is to allocate systemresources to specific running applications All the requiredsystem information is tracked by a Resource Containerwhich monitors CPU storage network and other importantresource attributes necessary for executing applications inthe cluster The ResourceManager has a slave NodeManagerservice to monitor application usage statistics Each deployedapplication is handled by a corresponding ApplicationMas-ter service If more resources are required to support therunning application the ApplicationMaster requests theNodeManager and the NodeManager negotiates with the

ResourceManager (scheduler) for the additional capacity onbehalf of the application [26]

22 Spark Apache Spark [29] originally developed as Berke-ley Spark was proposed as an alternative to Hadoop It canperform faster parallel computing operations by using in-memory primitives A job can load data in either local mem-ory or a cluster-wide shared memory and query it iterativelywith much great speed as compared to disk-based systemssuch as Hadoop MapReduce [27] Spark has been devel-oped for two applications where keeping data in memorymay significantly improve performance iterative machinelearning algorithms and interactive data mining Spark isalso intended to unify the current processing stack wherebatch processing is performed using MapReduce interac-tive queries are performed using HBase and the processingof streams for real-time analytics is performed using otherframeworks such Twitterrsquos Storm Spark offers programmersa functional programming paradigm with data-centric pro-gramming interfaces built on top of a new data model calledResilient Distributed Dataset (RDD) which is a collection ofobjects spread across a cluster stored in memory or disk [28]Applications in Spark can load these RDDs into the memoryof a cluster of nodes and let the Spark engine automaticallymanage the partitioning of the data and its locality duringruntime This versatile iterative model makes it possible tocontrol the persistence and manage the partitioning of dataA stream of incoming data can be partitioned into a seriesof batches and is processed as a sequence of small-batchjobs The Spark framework allows this seamless combinationof streaming and batch processing in a unified system Toprovide rapid application development Spark provides cleanconcise APIs in Scala Java and Python Spark can be usedinteractively from the Scala and Python shells to rapidlyquery big datasets

Spark is also the engine behind Shark a completeApache Hive-compatible data warehousing system that canrun much faster than Hive Spark also supports data accessfrom Hadoop Spark fits in seamlessly with the Hadoop 20ecosystem (Figure 2) as an alternative to MapReduce whileusing the same underlying infrastructure such as YARN andthe HDFS Spark is also an integral part of the SMACKstack to provide the most popular cloud-native PaaS suchas IoT predictive analytics and real-time personalizationfor big data In SMACK Apache Mesos cluster manager(instead of YARN) is used for dynamic allocation of clusterresources not only for running Hadoop applications but alsofor handling heterogeneous workloads

The GraphX and MLlib libraries include state-of-the-art graph and machine learning algorithms that can beexecuted in real time BlinkDB is a novel parallel sampling-based approximate query engine for running interactive SQLqueries that trade off query accuracy for response time withresults annotated bymeaningful error bars BlinkDBhas beenproven to run 200 times faster than Hive within an errorrate of 2ndash10 Moreover Spark provides an interactive toolcalled Spark Shell which allows exploiting the Spark clusterin real time Once interactive applications are created theymay subsequently be executed interactively in the cluster

Scientific Programming 5

MapReduce (Distributed Data Processing)

TEZ (Execution Engine)

Hardware

Hypervisor

OS

JVM

HDFS (Distributed File System)

YARN (Cluster Resource Management)

HbaseColumnOrientednoSQL

database

File DataLanguagePIG (Data

flow)

LanguageHive

(Queries SQL)

SearchLuceneOozie

(WorkflowScheduling)

Zookeeper (coordination)

Figure 2 Hadoop 20 ecosystem source [14]

MLlib SparkSQL Spark Streaming GraphX

Driver ProgramSpark Context

Cluster ManagerWorker Node

Worker Node Worker Node

Worker Node

Cluster Manager(i) Spark Standalone

(ii) YARN(iii) Mesos

DBMSHDFS Laptop Server

Figure 3 Spark architecture source [15]

In Figure 3 we present the general Spark system architec-ture

23 Flink Apache Flink is an emerging competitor of Sparkwhich offers functional programming interfaces much sim-ilar to Spark It shares many programming primitives andtransformations in the same way as what Spark does foriterative development predictive analysis and graph streamprocessing Flink is developed to fill the gap left by Sparkwhich uses minibatch streaming processing instead of apure streaming approach Flink ensures high processingperformance when dealing with complex big data structuressuch as graphs Flink programs are regular applicationswhichare written with a rich set of transformation operations (suchas mapping filtering grouping aggregating and joining) tothe input datasetsTheFlink dataset uses a table-basedmodeltherefore application developers can use index numbers tospecify a particular field of a dataset [27 28]

Flink is able to achieve high throughput and a low latencythereby processing a bundle of data very quickly Flink is

designed to run on large-scale clusters with thousands ofnodes and in addition to a standalone cluster mode Flinkprovides support for YARN For distributed environmentFlink chains operator subtasks together into tasks Each taskis executed by one thread [16] Flink runtime consists of twotypes of processes there is at least one JobManager (alsocalled masters) which coordinates the distributed executionIt schedules tasks coordinates checkpoints and coordinatesrecovery on failures A high-availability setup may involvemultiple JobManagers one of which one is always the leaderand the others are standby The TaskManagers (also calledworkers) execute the tasks (ormore specifically the subtasks)of a dataflowbuffer and exchange the data streams Theremust always be at least one TaskManager The JobManagersand TaskManagers can be started in various ways directly onthe machines as a standalone cluster in containers or man-aged by resource frameworks like YARN orMesos TaskMan-agers connect to JobManagers announcing themselves asavailable and are assigned work Figure 4 demonstrates themain components of Flink framework

24 Storm Storm [17] is a free open-source distributedstream processing computation framework It takes severalcharacteristics from the popular actor model and can beused with practically any kind of programming language fordeveloping applications such as real-time streaming analyticscritical work flow systems and data delivery services Theengine may process billions of tuples each day in a fault-tolerant way It can be integrated with popular resourcemanagement frameworks such as YARNMesos and DockerApache Storm cluster is made up of two types of processingactors spouts and bolts

(i) Spout is connected to the external data source of astream and is continuously emitting or collecting newdata for further processing

(ii) Bolt is a processing logic unit within a streamingprocessing topology each bolt is responsible for a

6 Scientific Programming

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

Data Streams

(Worker) (Worker)

Job Manager

Actor System

Scheduler

CheckpointCoordinator

Dataflow graph

Task StatusHeartbeatsStatistics

Deploystopcancel tasks

Triggercheckpoints

(MasterYARN Application Master)

Flink Program

Program code

OptimizerGraph Builder

Programdataflow

Client

ActorSystem

Dataflow graph

Statusupdates

Statisticsand

results

Cancelupdate job

Submit job (send dataflow)

Figure 4 Architectural components of Flink source [16]

certain processing task such as transformation filter-ing aggregating and partitioning

Storm defines workflow as directed acyclic graphs (DAGs)called topologies with connected spouts and bolts as verticesEdges in the graph define the link between the bolts andthe data stream Unlike batch jobs being only executedonce Storm jobs run forever until they are killed Thereare two types of nodes in a Storm cluster nimbus (masternode) and supervisor (worker node) Nimbus similar toHadoop JobTracker is the core component of Apache Stormand is responsible for distributing load across the clusterqueuing and assigning tasks to different processing units andmonitoring execution status Each worker node executes aprocess known as the supervisor which may have one ormore worker processes Supervisor delegates the tasks toworker processes Worker process then creates a subset oftopology to run the task Apache Storm does rely on aninternal distributed messaging system called Netty for thecommunication between nimbus and supervisors Zookeepermanages the communication between real-time job trackers(nimbus) and supervisors (Storm workers) Figure 5 outlinesthe high-level view of Storm cluster

25 Apache Samza Apache Samza [18] is a distributed streamprocessing framework mainly written in Scala and JavaOverall it has a relatively high throughput as well as some-what increased latency when compared to Storm [8] It usesApache Kafka which was originally developed for LinkedInfor messaging and streaming while Apache Hadoop YARNMesos is utilized as an execution platform for overall resourcemanagement Samza relies on Kafkarsquos semantics to define theway streams are handled Its main objective is to collect anddeliver massively large volumes of event data in particularlog data with a low latency A Kafka systemrsquos architecture iscomparatively simple as it only consists of a set of brokerswhich are individual nodes that make up a Kafka clusterData streams are defined by topics which is a stream ofrelated information that consumers can subscribe to Topicsare divided into partitions that are distributed over the brokerinstances for retrieving the corresponding messages using apull mechanismThe basic flow of job execution is presentedin Figure 6

Tables 1 and 2 present a brief comparative analysis of theseframeworks based on some common attributes As shownin the tables MapReduce computation data flow followschain of stages with no loop At each stage the program

Scientific Programming 7

Table1Com

paris

onof

bigdatafram

eworks

Attribute

Fram

ework

Hadoo

pSpark

Storm

Samza

Flink

Currentstablev

ersio

n281

220

111

0130

132

Batchprocessin

gYes

Yes

Yes

No

Yes

Com

putatio

nalm

odel

MapRe

duce

Stream

ing

(microbatches)

Stream

ing

(microbatches)

Stream

ing

Supp

ortscontinuo

usflo

wstream

ingmicrobatchandbatch

Datafl

owCh

ainof

stages

Dire

cted

acyclic

graph

Dire

cted

acyclic

graphs

(DAG

s)with

spou

tsandbo

lts

Stream

s(acyclic

graph)

Con

trolledcyclicd

ependency

graphthroug

hmachine

learning

Resource

managem

ent

YARN

YARN

Mesos

HDFS

(YARN

)Mesos

YARN

Mesos

ZookeeperYA

RNM

esos

Lang

uage

supp

ort

Allmajor

lang

uages

JavaScalaP

ytho

nandR

Any

programming

lang

uage

JVM

lang

uages

JavaScalaP

ytho

nandR

Jobmanagem

ento

ptim

ization

MapRe

duce

approach

Catalystextension

Storm-YARN

3rd-

partytoolslike

Ganglia

InternalJobR

unner

Internalop

timizer

Interactivem

ode

Non

e(3rd-partytools

likeImpalacanbe

integrated)

Interactives

hell

Non

eLimitedAPI

ofKa

fka

streams

Scalas

hell

Machine

learning

librarie

sAp

ache

Mahou

tH2O

SparkMLandMLlib

Trident-M

LAp

ache

SAMOA

Apache

SAMOA

Flink-ML

Maxim

umrepo

rted

nodes(scalability)

Yaho

oHadoo

pclu

ster

with

42000

nodes

8000

300

Link

edIn

with

arou

ndah

undred

node

cluste

rs

Alib

abac

ustomized

Flinkclu

ster

with

thou

sand

sofn

odes

8 Scientific Programming

Table 2 Comparative analysis of big data resource frameworks (119904 = 5)

Hadoop Spark Flink Storm SamzaProcessing speed permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermilFault tolerance permilpermilpermilpermil permilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilScalability permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil permilpermilpermilMachine learning permilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilLow latency permilpermil permilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilSecurity permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermil NDataset size permilpermilpermilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil

Task

Task

Task

Task

Executer Executer

Worker ProcessSupervisor

Worker node

Worker Process

Worker ProcessSupervisor

Worker node

Workers

NimbusMaster Node

Storm Cluster

Zookeeper Framework

Figure 5 Architecture of Storm Cluster source [17]

Kafka

Task

Task Task

YARN NodeManager

Machine

Samza Container

Samza Container

Samza Job

KafkaTopic 1

KafkaTopic 2

KafkaTopic 3

Figure 6 Samza architecture source [18]

proceeds with the output from the previous stage andgenerates an input for the next stage Although machinelearning algorithms are mostly designed in the form of cyclicdata flow Spark Storm and Samza represent it as directedacyclic graph to optimize the execution plan Flink supportscontrolled cyclic dependency graph in runtime to representthe machine learning algorithms in a very efficient way

Hadoop and Storm do not provide any default interactiveenvironment Apache Spark has a command-line interactiveshell to use the application features Flink provides a Scalashell to configure standalone as well as cluster setup ApacheHadoop is highly scalable and it has been used in theYahoo production consisting of 42000 nodes in 20 YARNclusters The largest known cluster size for Spark is of 8000computing nodes while Stormhas been tested on amaximumof 300 node clusters Apache Samza cluster with around ahundred nodes has been used in LinkedIn data flow andapplication messaging system Apache Flink has been custo-mized for Alibaba search engine with a deployment capacityof thousands of processing nodes

3 Comparative Evaluation ofBig Data Frameworks

Big data in cloud computing a popular research trendis posing significant influence on current enterprises ITindustries and research communities There are a numberof disruptive and transformative big data technologies andsolutions that are rapidly emanating and evolving in orderto provide data-driven insight and innovation Furthermoremodern cloud computing services are offering all kinds ofbig data analytic tools technologies and computing infras-tructures to speed up the data analysis process at an afford-able cost Although many distributed resource managementframeworks are available nowadays the main issue is how to

Scientific Programming 9

select a suitable big data framework The selection of one bigdata platform over the others will come down to the specificapplication requirements and constraints that may involveseveral tradeoffs and application usage scenarios Howeverwe can identify some key factors that need to be fulfilledbefore deploying a big data application in the cloud In thissection based on some empirical evidence from the availableliterature we discuss the advantages and disadvantages ofeach resource management framework

31 Processing Speed Processing speed is an important per-formance measurement that may be used to evaluate theeffectiveness of different resource management frameworksIt is a commonmetric for the maximum number of IO oper-ations to disk ormemory or the data transfer rate between thecomputational units of the cluster over a specific amount oftime Based on the context of big data the average processingspeed represented as119898 calculated after 119899 iterations run is themaximum amount of memorydisk intensive operations thatcan be performed over a time interval 119905119894

119898 = sum119899119894=1119898119894sum119899119894=1 119905119894 (1)

Veiga et al [30] conducted a series of experiments on amulticore cluster setup to demonstrate performance resultsof Apache Hadoop Spark and Flink Apache Spark andFlink resulted to be much efficient execution platforms overHadoop while performing nonsort benchmarks It was fur-ther noted that Spark showed better performance results foroperations such asWordCount and119870-Means (CPU-bound innature) while Flink achieved better results in PageRank algo-rithm (memory bound in nature) Mavridis and Karatza [31]experimentally compared performance statistics of ApacheHadoop and Spark onOkeanos IaaS cloud platform For eachset of experiments necessary statistics related to executiontime working nodes and the dataset size were recordedSpark performance was found optimal as compared toHadoop for most of the cases Furthermore Spark on YARNplatform showed suboptimal results as compared to the casewhen it was executed in standalone mode Some similarresults were also observed by Zaharia et al [32] on a 100GBdataset record Vellaipandiyan and Raja [33] demonstratedperformance evaluation and comparison of Hadoop andSpark frameworks on residentrsquos record dataset ranging from100GB to 900GB of size Spark scale of performance wasrelatively better when the dataset size was between small andmedium size (100GBndash750GB) afterwards its performancedeclined as compared to HadoopThe primary reason for theperformance decline was evident as Spark cache size couldnot fit into the memory for the larger dataset Taran et al[34] quantified performance differences ofHadoop and Sparkusing WordCount dataset which was ranging from 100KBto 1GB It was observed that Hadoop framework was fivetimes faster than Spark when the evaluation was performedusing a larger set of data sources However for the smallertasks Spark showed better performance results However thespeed-up ratio was decreased for both databases with thegrowth of input dataset

Gopalani and Arora [35] used 119870-Means algorithm onsome small medium and large location datasets to compareHadoop and Spark frameworksThe study results showed thatSpark performed up to three times better than MapReducefor most of the cases Bertoni et al [36] performed theexperimental evaluation of Apache Flink and Storm usinglarge genomic dataset data on Amazon EC2 cloud ApacheFlink was superior to Storm while performing histogramand map operations while Storm outperformed Flink whilegenomic join application was deployed

32 Fault Tolerance Fault tolerance is the characteristicthat enables a system to continue functioning in case ofthe failure of one or more components High-performancecomputing applications involve hundreds of nodes that areinterconnected to perform a specific task failing a nodeshould have zero or minimal effect on overall computationThe tolerance of a system represented as TolFT to meet itsrequirements after a disruption is the ratio of the time tocomplete tasks without observing any fault events to theoverall execution time where some fault events were detectedand the system state is reverted back to consistent state

TolFT =119879119909119879119909 + 1205902 (2)

where 119879119909 is the estimated correct execution time obtainedfrom a program run that is presumed to be fault-free or byaveraging the execution time from several application runsthat produce a known correct output and 1205902 represents thevariance in a programrsquos execution time due to the occurrenceof fault events For anHPC application that consists of a set ofcomputationally intensive tasks Γ = 1205911 1205912 120591119899 and sinceTolFT for each individual task is computed as Tol120591119894 then theoverall application resilience Tol may be calculated as [37]

Tol = 119899radicTol1205911 sdot Tol1205912 sdot sdot Tol120591119899 (3)

Lu et al [38] used StreamBench toolkit to evaluate perfor-mance and fault tolerance ability of Apache Spark StormSpark and Samza It was found that with the increased datasize Spark is much stable and fault-tolerant as compared toStorm but may be less efficient when compared to SamzaFurthermore when compared in terms of handling largecapacity of data both Samza and Spark outperformed StormGu and Li [39] used PageRank algorithm to perform acomparative experiment on Hadoop and Spark frameworksIt was observed that for smaller datasets such as wiki-Voteand soc-Slashdot0902 Spark outperformed Hadoop with amuch better margin However this speed-up result degradedwith the growth of dataset and for large datasets Hadoopeasily outperformed Spark Furthermore for massively largedatasets Spark was reported to be crashed with JVM heapexception while Hadoop still performed its task Lopez et al[40] evaluated throughput and fault tolerance mechanism ofApache Storm and Flink The experiments were based on athreat detection system where Apache Storm demonstratedbetter throughput as compared to Flink For fault tolerance

10 Scientific Programming

different virtual machines were manually turned off to ana-lyze the impact of node failures Apache Flink used its internalsubsystem to detect and migrate the failed tasks to othermachines and hence resulted in very few message lossesOn the other hand Storm took more time as Zookeeperinvolving some performance overhead was responsible forreporting the state of nimbus and thereafter processing thefailed task on other nodes

33 Scalability Scalability refers to the ability to accommo-date large loads or change in sizeworkload by provisioning ofresources at runtimeThis can be further categorized as scale-up (by making hardware stronger) or scale-down (by addingadditional nodes) One of the critical requirements of enter-prises is to process large volumes of data in a timely mannerto address high-value business problems Dynamic resourcescalability allows business entities to perform massive com-putation in parallel thus reducing overall time complexityand effort The definition of scalability comes from Amdahlrsquosand Gustafsonrsquos laws [41] Let 119882 be the size of workloadbefore the improvement of the system resources the fractionof the execution workload that benefits from the improve-ment of system resources is 120572 and the fraction concerningthe part that would not benefit from improvement in theresources is 1 minus 120572 When using an 119899-processor system userworkload is scaled to

= 120572119882 + (1 minus 120572) 119899119882 (4)

The parallel execution time of a scaled workload on 119899-processors is defined as scaled-workload speed-up 119878 as shownin

119878 = 119882 =120572119882 + (1 minus 120572) 119899119882119882 (5)

Garcıa-Gil et al [42] performed scalability and performancecomparison of Apache Spark and Flink using feature selec-tion framework to assemble multiple information theoreticcriteria into a single greedy algorithm The ECBDL14 datasetwas used to measure scalability factor for the frameworksIt was observed that Spark scalability performance was 4ndash10times faster than Flink Jakovits and Srirama [43] analyzedfour MapReduce based frameworks including Hadoop andSpark for benchmarking partitioning aroundMedoids Clus-tering Large Applications and Conjugate Gradient linearsystem solver algorithms using MPI All experiments wereperformed on 33 Amazon EC2 large instances cloud Forall algorithms Spark performed much better as comparedto Hadoop in terms of both performance and scalabilityBoden et al [44] aimed to investigate the scalability withrespect to both data size and dimensionality in order todemonstrate a fair and insightful benchmark that reflects therequirements of real-world machine learning applicationsThe benchmark was comprised of distributed optimizationalgorithms for supervised learning as well as algorithms forunsupervised learning For supervised learning they imple-mented machine learning algorithms by using Breeze librarywhile for unsupervised learning they chose the popular 119896-Means clustering algorithm in order to assess the scalability

of the two resource management frameworks The overallexecution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes whileSpark had a clear edge once enough main memory wasavailable due to the addition of new computing nodes

34 Machine Learning and Iterative Tasks Support Big dataapplications are inherently complex in nature and usuallyinvolve tasks and algorithms that are iterative in natureTheseapplications have distinct cyclic nature to achieve the desiredresult by continually repeating a set of tasks until these cannotbe substantially reduced further

Spangenberg et al [45] used real-world datasets consist-ing of four algorithms that is WordCount119870-Means PageR-ank and relational query to benchmark Apache Flink andStorm It was observed that Apache Storm performs better inbatchmode as compared to Flink However with the increas-ing complexity Apache Flink had a performance advantageover Storm and thus it was better suited for iterative dataor graph processing Shi et al [46] focused on analyz-ing Apache MapReduce and Spark for batch and iterativejobs For smaller datasets Spark resulted to be a better choicebut when experiments were performed on larger datasetsMapReduce turned out to be several times faster than SparkFor iterative operations such as 119870-Means Spark turned outto be 15 times faster as compared to MapReduce in its firstiteration while Spark was more than 5 times faster in sub-sequent operations

Kang and Lee [47] examined five resource manage-ment frameworks including Apache Hadoop and Sparkwith respect to performance overheads (disk inputoutputnetwork communication scheduling etc) in supportingiterative computation The PageRank algorithm was used toevaluate these performance issues Since static data process-ing tends to be a more frequent operation than dynamic dataas it is used in every iteration of MapReduce it may causesignificant performance overhead in case of MapReduceOn the other hand Apache Spark uses read-only cachedversion of objects (resilient distributed dataset) which can bereused in parallel operations thus reducing the performanceoverhead during iterative computation Lee et al [48] evalu-ated five systems including Hadoop and Spark over variousworkloads to compare against four iterative algorithms Theexperimentation was performed on Amazon EC2 cloudOverall Spark showed the best performance when iterativeoperations were performed in main memory In contrast theperformance of Hadoop was significantly poor as comparedto other resource management systems

35 Latency Big data and low latency are strongly linkedBig data applications provide true value to businesses butthese are mostly time critical If cloud computing has to bethe successful platform for big data implementation one ofthe key requirements will be the provisioning of high-speednetwork to reduce communication latency Furthermore bigdata frameworks usually involve centralized design wherethe scheduler assigns all tasks through a single node whichmay significantly impact the latency when the size of data ishuge

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 3: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

Scientific Programming 3

ComputeInfrastructure

Batch Streaming

Hadoop Continuous StreamingMicro Batch

SparkCheck pointing based Acknowledge based

Flink

Samza

Storm

BIGDATA

Figure 1 Classification of big data resource management frameworks

(ii) Several vendors offer their proprietary solutions forbig data analysis which could be the potential candi-date for comparative analysis being conducted in thisstudy However these frameworks were not selectedbased on two reasons Firstly most of these solutionsare the extension of open-source solution and hencethese exhibit the identical perform results in mostof the cases Secondly for our empirical studiesresearchers mostly prefer open-source solutions asthe documentation usage scenarios source code andother relevant details are freely available Hence weselected open-source solutions for the performanceevaluation

(iii) We did not include the frameworks which are nowdeprecated or discounted such as Apache S4 in favorof other resource management systems

This paper is organized as follows Section 2 reviews thepopular resource management frameworks The comparisonof big data frameworks is presented in Section 3 Based on thecomparative evaluation we categorize these systems in Sec-tion 4 Related work is presented in Section 5 and finallywe present conclusion and possible future directions inSection 6

2 Big Data ResourceManagement Frameworks

Big data is offering new emerging trends and opportunities tounearth operational insight towards data management Themost challenging issues for organizations are often that theamount of data is massive which needs to be processed at anoptimal speed to synthesize relevant results Analyzing such

huge amount of data from multiple sources can help orga-nizations plan for the future and anticipate changing markettrends and customer requirements In many of the cases bigdata is analyzed in batch mode However in many situationswe may need to react to the current state of data or analyzethe data that is in motion (data that is constantly coming inand needs to be processed immediately) These applicationsrequire a continuous stream of often unstructured data tobe processed Therefore data is continuously analyzed andcached in memory before it is stored on secondary storagedevices Processing streams of data works by filtering in-memory tables of data across a cluster of servers Any delayin the data analysis can seriously impact customer satisfactionor may result in project failure [25]

While the Hadoop framework is a popular platformfor processing huge datasets in parallel batch mode usingcommodity computational resources there are a numberof other computing infrastructures that can be used invarious application domains The primary focus of this studyis to investigate popular big data resource managementframeworks which are commonly used in cloud computingenvironment Most of the popular big data tools available forcloud computing platform including the Hadoop ecosystemare available under open-source licenses One of the keyappeals of Hadoop and other open-source solutions is thelow total cost of ownership While proprietary solutionshave expensive license fees and may require more costlyspecialized hardware these open-source solutions have nolicensing fees and can run on industry-standard hardware[14] Figure 1 demonstrates the classification of various stylesof processing architectures of open-source big data resourcemanagement frameworks

In the subsequent section we discuss various open-source big data resource management frameworks that are

4 Scientific Programming

widely used in conjunction with cloud computing environ-ment

21 Hadoop Hadoop [26] is a distributed programming andstorage infrastructure based on the open-source implementa-tion of theMapReducemodel [27]MapReduce is the first andcurrent de facto programming environment for developingdata-centric parallel applications for parsing and processinglarge datasetsTheMapReduce is inspired byMap andReduceprimitives used in functional programming In MapReduceprogramming users only have to write the logic of Mapperand Reducer while the process of shuffling partitioning andsorting is automatically handled by the execution engine [1427 28]The data can either be saved in theHadoop file systemas unstructured data or in a database as structured data [14]Hadoop Distributed File System (HDFS) is responsible forbreaking large data files into smaller pieces known as blocksThe blocks are placed on different data nodes and it is thejob of the NameNode to notice what blocks on which datanodes make up the complete file The NameNode also worksas a traffic cop handling all access to the files including readswrites creates deletes and replication of data blocks on thedata nodes A pipeline is a link between multiple data nodesthat exists to handle the transfer of data across the servers Auser application pushes a block to the first data node in thepipeline The data node takes over and forwards the block tothe next node in the pipeline this continues until all the dataand all the data replicas are saved to disk Afterwards theclient repeats the process by writing the next block in the file[25]

The two major components of Hadoop MapReduce arejob scheduling and tracking The early versions of Hadoopsupported limited job and task tracking system In particularthe earlier scheduler could not manage non-MapReducetasks and it was not capable of optimizing cluster utiliza-tion So a new capability was aimed at addressing theseshortcomings which may offer more flexibility scaling effi-ciency and performance Because of these issues Hadoop 20was introduced Alongside earlier HDFS resource manage-ment and MapReduce model it introduced a new resourcemanagement layer called Yet Another Resource Negotia-tor (YARN) that takes care of better resource utilization[25]

YARN is the core Hadoop service to provide two majorfunctionalities global resource management (ResourceMan-ager) and per-application management (ApplicationMaster)The ResourceManager is a master service which controlsNodeManager in each of the nodes of a Hadoop cluster Itincludes a scheduler whose main task is to allocate systemresources to specific running applications All the requiredsystem information is tracked by a Resource Containerwhich monitors CPU storage network and other importantresource attributes necessary for executing applications inthe cluster The ResourceManager has a slave NodeManagerservice to monitor application usage statistics Each deployedapplication is handled by a corresponding ApplicationMas-ter service If more resources are required to support therunning application the ApplicationMaster requests theNodeManager and the NodeManager negotiates with the

ResourceManager (scheduler) for the additional capacity onbehalf of the application [26]

22 Spark Apache Spark [29] originally developed as Berke-ley Spark was proposed as an alternative to Hadoop It canperform faster parallel computing operations by using in-memory primitives A job can load data in either local mem-ory or a cluster-wide shared memory and query it iterativelywith much great speed as compared to disk-based systemssuch as Hadoop MapReduce [27] Spark has been devel-oped for two applications where keeping data in memorymay significantly improve performance iterative machinelearning algorithms and interactive data mining Spark isalso intended to unify the current processing stack wherebatch processing is performed using MapReduce interac-tive queries are performed using HBase and the processingof streams for real-time analytics is performed using otherframeworks such Twitterrsquos Storm Spark offers programmersa functional programming paradigm with data-centric pro-gramming interfaces built on top of a new data model calledResilient Distributed Dataset (RDD) which is a collection ofobjects spread across a cluster stored in memory or disk [28]Applications in Spark can load these RDDs into the memoryof a cluster of nodes and let the Spark engine automaticallymanage the partitioning of the data and its locality duringruntime This versatile iterative model makes it possible tocontrol the persistence and manage the partitioning of dataA stream of incoming data can be partitioned into a seriesof batches and is processed as a sequence of small-batchjobs The Spark framework allows this seamless combinationof streaming and batch processing in a unified system Toprovide rapid application development Spark provides cleanconcise APIs in Scala Java and Python Spark can be usedinteractively from the Scala and Python shells to rapidlyquery big datasets

Spark is also the engine behind Shark a completeApache Hive-compatible data warehousing system that canrun much faster than Hive Spark also supports data accessfrom Hadoop Spark fits in seamlessly with the Hadoop 20ecosystem (Figure 2) as an alternative to MapReduce whileusing the same underlying infrastructure such as YARN andthe HDFS Spark is also an integral part of the SMACKstack to provide the most popular cloud-native PaaS suchas IoT predictive analytics and real-time personalizationfor big data In SMACK Apache Mesos cluster manager(instead of YARN) is used for dynamic allocation of clusterresources not only for running Hadoop applications but alsofor handling heterogeneous workloads

The GraphX and MLlib libraries include state-of-the-art graph and machine learning algorithms that can beexecuted in real time BlinkDB is a novel parallel sampling-based approximate query engine for running interactive SQLqueries that trade off query accuracy for response time withresults annotated bymeaningful error bars BlinkDBhas beenproven to run 200 times faster than Hive within an errorrate of 2ndash10 Moreover Spark provides an interactive toolcalled Spark Shell which allows exploiting the Spark clusterin real time Once interactive applications are created theymay subsequently be executed interactively in the cluster

Scientific Programming 5

MapReduce (Distributed Data Processing)

TEZ (Execution Engine)

Hardware

Hypervisor

OS

JVM

HDFS (Distributed File System)

YARN (Cluster Resource Management)

HbaseColumnOrientednoSQL

database

File DataLanguagePIG (Data

flow)

LanguageHive

(Queries SQL)

SearchLuceneOozie

(WorkflowScheduling)

Zookeeper (coordination)

Figure 2 Hadoop 20 ecosystem source [14]

MLlib SparkSQL Spark Streaming GraphX

Driver ProgramSpark Context

Cluster ManagerWorker Node

Worker Node Worker Node

Worker Node

Cluster Manager(i) Spark Standalone

(ii) YARN(iii) Mesos

DBMSHDFS Laptop Server

Figure 3 Spark architecture source [15]

In Figure 3 we present the general Spark system architec-ture

23 Flink Apache Flink is an emerging competitor of Sparkwhich offers functional programming interfaces much sim-ilar to Spark It shares many programming primitives andtransformations in the same way as what Spark does foriterative development predictive analysis and graph streamprocessing Flink is developed to fill the gap left by Sparkwhich uses minibatch streaming processing instead of apure streaming approach Flink ensures high processingperformance when dealing with complex big data structuressuch as graphs Flink programs are regular applicationswhichare written with a rich set of transformation operations (suchas mapping filtering grouping aggregating and joining) tothe input datasetsTheFlink dataset uses a table-basedmodeltherefore application developers can use index numbers tospecify a particular field of a dataset [27 28]

Flink is able to achieve high throughput and a low latencythereby processing a bundle of data very quickly Flink is

designed to run on large-scale clusters with thousands ofnodes and in addition to a standalone cluster mode Flinkprovides support for YARN For distributed environmentFlink chains operator subtasks together into tasks Each taskis executed by one thread [16] Flink runtime consists of twotypes of processes there is at least one JobManager (alsocalled masters) which coordinates the distributed executionIt schedules tasks coordinates checkpoints and coordinatesrecovery on failures A high-availability setup may involvemultiple JobManagers one of which one is always the leaderand the others are standby The TaskManagers (also calledworkers) execute the tasks (ormore specifically the subtasks)of a dataflowbuffer and exchange the data streams Theremust always be at least one TaskManager The JobManagersand TaskManagers can be started in various ways directly onthe machines as a standalone cluster in containers or man-aged by resource frameworks like YARN orMesos TaskMan-agers connect to JobManagers announcing themselves asavailable and are assigned work Figure 4 demonstrates themain components of Flink framework

24 Storm Storm [17] is a free open-source distributedstream processing computation framework It takes severalcharacteristics from the popular actor model and can beused with practically any kind of programming language fordeveloping applications such as real-time streaming analyticscritical work flow systems and data delivery services Theengine may process billions of tuples each day in a fault-tolerant way It can be integrated with popular resourcemanagement frameworks such as YARNMesos and DockerApache Storm cluster is made up of two types of processingactors spouts and bolts

(i) Spout is connected to the external data source of astream and is continuously emitting or collecting newdata for further processing

(ii) Bolt is a processing logic unit within a streamingprocessing topology each bolt is responsible for a

6 Scientific Programming

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

Data Streams

(Worker) (Worker)

Job Manager

Actor System

Scheduler

CheckpointCoordinator

Dataflow graph

Task StatusHeartbeatsStatistics

Deploystopcancel tasks

Triggercheckpoints

(MasterYARN Application Master)

Flink Program

Program code

OptimizerGraph Builder

Programdataflow

Client

ActorSystem

Dataflow graph

Statusupdates

Statisticsand

results

Cancelupdate job

Submit job (send dataflow)

Figure 4 Architectural components of Flink source [16]

certain processing task such as transformation filter-ing aggregating and partitioning

Storm defines workflow as directed acyclic graphs (DAGs)called topologies with connected spouts and bolts as verticesEdges in the graph define the link between the bolts andthe data stream Unlike batch jobs being only executedonce Storm jobs run forever until they are killed Thereare two types of nodes in a Storm cluster nimbus (masternode) and supervisor (worker node) Nimbus similar toHadoop JobTracker is the core component of Apache Stormand is responsible for distributing load across the clusterqueuing and assigning tasks to different processing units andmonitoring execution status Each worker node executes aprocess known as the supervisor which may have one ormore worker processes Supervisor delegates the tasks toworker processes Worker process then creates a subset oftopology to run the task Apache Storm does rely on aninternal distributed messaging system called Netty for thecommunication between nimbus and supervisors Zookeepermanages the communication between real-time job trackers(nimbus) and supervisors (Storm workers) Figure 5 outlinesthe high-level view of Storm cluster

25 Apache Samza Apache Samza [18] is a distributed streamprocessing framework mainly written in Scala and JavaOverall it has a relatively high throughput as well as some-what increased latency when compared to Storm [8] It usesApache Kafka which was originally developed for LinkedInfor messaging and streaming while Apache Hadoop YARNMesos is utilized as an execution platform for overall resourcemanagement Samza relies on Kafkarsquos semantics to define theway streams are handled Its main objective is to collect anddeliver massively large volumes of event data in particularlog data with a low latency A Kafka systemrsquos architecture iscomparatively simple as it only consists of a set of brokerswhich are individual nodes that make up a Kafka clusterData streams are defined by topics which is a stream ofrelated information that consumers can subscribe to Topicsare divided into partitions that are distributed over the brokerinstances for retrieving the corresponding messages using apull mechanismThe basic flow of job execution is presentedin Figure 6

Tables 1 and 2 present a brief comparative analysis of theseframeworks based on some common attributes As shownin the tables MapReduce computation data flow followschain of stages with no loop At each stage the program

Scientific Programming 7

Table1Com

paris

onof

bigdatafram

eworks

Attribute

Fram

ework

Hadoo

pSpark

Storm

Samza

Flink

Currentstablev

ersio

n281

220

111

0130

132

Batchprocessin

gYes

Yes

Yes

No

Yes

Com

putatio

nalm

odel

MapRe

duce

Stream

ing

(microbatches)

Stream

ing

(microbatches)

Stream

ing

Supp

ortscontinuo

usflo

wstream

ingmicrobatchandbatch

Datafl

owCh

ainof

stages

Dire

cted

acyclic

graph

Dire

cted

acyclic

graphs

(DAG

s)with

spou

tsandbo

lts

Stream

s(acyclic

graph)

Con

trolledcyclicd

ependency

graphthroug

hmachine

learning

Resource

managem

ent

YARN

YARN

Mesos

HDFS

(YARN

)Mesos

YARN

Mesos

ZookeeperYA

RNM

esos

Lang

uage

supp

ort

Allmajor

lang

uages

JavaScalaP

ytho

nandR

Any

programming

lang

uage

JVM

lang

uages

JavaScalaP

ytho

nandR

Jobmanagem

ento

ptim

ization

MapRe

duce

approach

Catalystextension

Storm-YARN

3rd-

partytoolslike

Ganglia

InternalJobR

unner

Internalop

timizer

Interactivem

ode

Non

e(3rd-partytools

likeImpalacanbe

integrated)

Interactives

hell

Non

eLimitedAPI

ofKa

fka

streams

Scalas

hell

Machine

learning

librarie

sAp

ache

Mahou

tH2O

SparkMLandMLlib

Trident-M

LAp

ache

SAMOA

Apache

SAMOA

Flink-ML

Maxim

umrepo

rted

nodes(scalability)

Yaho

oHadoo

pclu

ster

with

42000

nodes

8000

300

Link

edIn

with

arou

ndah

undred

node

cluste

rs

Alib

abac

ustomized

Flinkclu

ster

with

thou

sand

sofn

odes

8 Scientific Programming

Table 2 Comparative analysis of big data resource frameworks (119904 = 5)

Hadoop Spark Flink Storm SamzaProcessing speed permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermilFault tolerance permilpermilpermilpermil permilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilScalability permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil permilpermilpermilMachine learning permilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilLow latency permilpermil permilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilSecurity permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermil NDataset size permilpermilpermilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil

Task

Task

Task

Task

Executer Executer

Worker ProcessSupervisor

Worker node

Worker Process

Worker ProcessSupervisor

Worker node

Workers

NimbusMaster Node

Storm Cluster

Zookeeper Framework

Figure 5 Architecture of Storm Cluster source [17]

Kafka

Task

Task Task

YARN NodeManager

Machine

Samza Container

Samza Container

Samza Job

KafkaTopic 1

KafkaTopic 2

KafkaTopic 3

Figure 6 Samza architecture source [18]

proceeds with the output from the previous stage andgenerates an input for the next stage Although machinelearning algorithms are mostly designed in the form of cyclicdata flow Spark Storm and Samza represent it as directedacyclic graph to optimize the execution plan Flink supportscontrolled cyclic dependency graph in runtime to representthe machine learning algorithms in a very efficient way

Hadoop and Storm do not provide any default interactiveenvironment Apache Spark has a command-line interactiveshell to use the application features Flink provides a Scalashell to configure standalone as well as cluster setup ApacheHadoop is highly scalable and it has been used in theYahoo production consisting of 42000 nodes in 20 YARNclusters The largest known cluster size for Spark is of 8000computing nodes while Stormhas been tested on amaximumof 300 node clusters Apache Samza cluster with around ahundred nodes has been used in LinkedIn data flow andapplication messaging system Apache Flink has been custo-mized for Alibaba search engine with a deployment capacityof thousands of processing nodes

3 Comparative Evaluation ofBig Data Frameworks

Big data in cloud computing a popular research trendis posing significant influence on current enterprises ITindustries and research communities There are a numberof disruptive and transformative big data technologies andsolutions that are rapidly emanating and evolving in orderto provide data-driven insight and innovation Furthermoremodern cloud computing services are offering all kinds ofbig data analytic tools technologies and computing infras-tructures to speed up the data analysis process at an afford-able cost Although many distributed resource managementframeworks are available nowadays the main issue is how to

Scientific Programming 9

select a suitable big data framework The selection of one bigdata platform over the others will come down to the specificapplication requirements and constraints that may involveseveral tradeoffs and application usage scenarios Howeverwe can identify some key factors that need to be fulfilledbefore deploying a big data application in the cloud In thissection based on some empirical evidence from the availableliterature we discuss the advantages and disadvantages ofeach resource management framework

31 Processing Speed Processing speed is an important per-formance measurement that may be used to evaluate theeffectiveness of different resource management frameworksIt is a commonmetric for the maximum number of IO oper-ations to disk ormemory or the data transfer rate between thecomputational units of the cluster over a specific amount oftime Based on the context of big data the average processingspeed represented as119898 calculated after 119899 iterations run is themaximum amount of memorydisk intensive operations thatcan be performed over a time interval 119905119894

119898 = sum119899119894=1119898119894sum119899119894=1 119905119894 (1)

Veiga et al [30] conducted a series of experiments on amulticore cluster setup to demonstrate performance resultsof Apache Hadoop Spark and Flink Apache Spark andFlink resulted to be much efficient execution platforms overHadoop while performing nonsort benchmarks It was fur-ther noted that Spark showed better performance results foroperations such asWordCount and119870-Means (CPU-bound innature) while Flink achieved better results in PageRank algo-rithm (memory bound in nature) Mavridis and Karatza [31]experimentally compared performance statistics of ApacheHadoop and Spark onOkeanos IaaS cloud platform For eachset of experiments necessary statistics related to executiontime working nodes and the dataset size were recordedSpark performance was found optimal as compared toHadoop for most of the cases Furthermore Spark on YARNplatform showed suboptimal results as compared to the casewhen it was executed in standalone mode Some similarresults were also observed by Zaharia et al [32] on a 100GBdataset record Vellaipandiyan and Raja [33] demonstratedperformance evaluation and comparison of Hadoop andSpark frameworks on residentrsquos record dataset ranging from100GB to 900GB of size Spark scale of performance wasrelatively better when the dataset size was between small andmedium size (100GBndash750GB) afterwards its performancedeclined as compared to HadoopThe primary reason for theperformance decline was evident as Spark cache size couldnot fit into the memory for the larger dataset Taran et al[34] quantified performance differences ofHadoop and Sparkusing WordCount dataset which was ranging from 100KBto 1GB It was observed that Hadoop framework was fivetimes faster than Spark when the evaluation was performedusing a larger set of data sources However for the smallertasks Spark showed better performance results However thespeed-up ratio was decreased for both databases with thegrowth of input dataset

Gopalani and Arora [35] used 119870-Means algorithm onsome small medium and large location datasets to compareHadoop and Spark frameworksThe study results showed thatSpark performed up to three times better than MapReducefor most of the cases Bertoni et al [36] performed theexperimental evaluation of Apache Flink and Storm usinglarge genomic dataset data on Amazon EC2 cloud ApacheFlink was superior to Storm while performing histogramand map operations while Storm outperformed Flink whilegenomic join application was deployed

32 Fault Tolerance Fault tolerance is the characteristicthat enables a system to continue functioning in case ofthe failure of one or more components High-performancecomputing applications involve hundreds of nodes that areinterconnected to perform a specific task failing a nodeshould have zero or minimal effect on overall computationThe tolerance of a system represented as TolFT to meet itsrequirements after a disruption is the ratio of the time tocomplete tasks without observing any fault events to theoverall execution time where some fault events were detectedand the system state is reverted back to consistent state

TolFT =119879119909119879119909 + 1205902 (2)

where 119879119909 is the estimated correct execution time obtainedfrom a program run that is presumed to be fault-free or byaveraging the execution time from several application runsthat produce a known correct output and 1205902 represents thevariance in a programrsquos execution time due to the occurrenceof fault events For anHPC application that consists of a set ofcomputationally intensive tasks Γ = 1205911 1205912 120591119899 and sinceTolFT for each individual task is computed as Tol120591119894 then theoverall application resilience Tol may be calculated as [37]

Tol = 119899radicTol1205911 sdot Tol1205912 sdot sdot Tol120591119899 (3)

Lu et al [38] used StreamBench toolkit to evaluate perfor-mance and fault tolerance ability of Apache Spark StormSpark and Samza It was found that with the increased datasize Spark is much stable and fault-tolerant as compared toStorm but may be less efficient when compared to SamzaFurthermore when compared in terms of handling largecapacity of data both Samza and Spark outperformed StormGu and Li [39] used PageRank algorithm to perform acomparative experiment on Hadoop and Spark frameworksIt was observed that for smaller datasets such as wiki-Voteand soc-Slashdot0902 Spark outperformed Hadoop with amuch better margin However this speed-up result degradedwith the growth of dataset and for large datasets Hadoopeasily outperformed Spark Furthermore for massively largedatasets Spark was reported to be crashed with JVM heapexception while Hadoop still performed its task Lopez et al[40] evaluated throughput and fault tolerance mechanism ofApache Storm and Flink The experiments were based on athreat detection system where Apache Storm demonstratedbetter throughput as compared to Flink For fault tolerance

10 Scientific Programming

different virtual machines were manually turned off to ana-lyze the impact of node failures Apache Flink used its internalsubsystem to detect and migrate the failed tasks to othermachines and hence resulted in very few message lossesOn the other hand Storm took more time as Zookeeperinvolving some performance overhead was responsible forreporting the state of nimbus and thereafter processing thefailed task on other nodes

33 Scalability Scalability refers to the ability to accommo-date large loads or change in sizeworkload by provisioning ofresources at runtimeThis can be further categorized as scale-up (by making hardware stronger) or scale-down (by addingadditional nodes) One of the critical requirements of enter-prises is to process large volumes of data in a timely mannerto address high-value business problems Dynamic resourcescalability allows business entities to perform massive com-putation in parallel thus reducing overall time complexityand effort The definition of scalability comes from Amdahlrsquosand Gustafsonrsquos laws [41] Let 119882 be the size of workloadbefore the improvement of the system resources the fractionof the execution workload that benefits from the improve-ment of system resources is 120572 and the fraction concerningthe part that would not benefit from improvement in theresources is 1 minus 120572 When using an 119899-processor system userworkload is scaled to

= 120572119882 + (1 minus 120572) 119899119882 (4)

The parallel execution time of a scaled workload on 119899-processors is defined as scaled-workload speed-up 119878 as shownin

119878 = 119882 =120572119882 + (1 minus 120572) 119899119882119882 (5)

Garcıa-Gil et al [42] performed scalability and performancecomparison of Apache Spark and Flink using feature selec-tion framework to assemble multiple information theoreticcriteria into a single greedy algorithm The ECBDL14 datasetwas used to measure scalability factor for the frameworksIt was observed that Spark scalability performance was 4ndash10times faster than Flink Jakovits and Srirama [43] analyzedfour MapReduce based frameworks including Hadoop andSpark for benchmarking partitioning aroundMedoids Clus-tering Large Applications and Conjugate Gradient linearsystem solver algorithms using MPI All experiments wereperformed on 33 Amazon EC2 large instances cloud Forall algorithms Spark performed much better as comparedto Hadoop in terms of both performance and scalabilityBoden et al [44] aimed to investigate the scalability withrespect to both data size and dimensionality in order todemonstrate a fair and insightful benchmark that reflects therequirements of real-world machine learning applicationsThe benchmark was comprised of distributed optimizationalgorithms for supervised learning as well as algorithms forunsupervised learning For supervised learning they imple-mented machine learning algorithms by using Breeze librarywhile for unsupervised learning they chose the popular 119896-Means clustering algorithm in order to assess the scalability

of the two resource management frameworks The overallexecution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes whileSpark had a clear edge once enough main memory wasavailable due to the addition of new computing nodes

34 Machine Learning and Iterative Tasks Support Big dataapplications are inherently complex in nature and usuallyinvolve tasks and algorithms that are iterative in natureTheseapplications have distinct cyclic nature to achieve the desiredresult by continually repeating a set of tasks until these cannotbe substantially reduced further

Spangenberg et al [45] used real-world datasets consist-ing of four algorithms that is WordCount119870-Means PageR-ank and relational query to benchmark Apache Flink andStorm It was observed that Apache Storm performs better inbatchmode as compared to Flink However with the increas-ing complexity Apache Flink had a performance advantageover Storm and thus it was better suited for iterative dataor graph processing Shi et al [46] focused on analyz-ing Apache MapReduce and Spark for batch and iterativejobs For smaller datasets Spark resulted to be a better choicebut when experiments were performed on larger datasetsMapReduce turned out to be several times faster than SparkFor iterative operations such as 119870-Means Spark turned outto be 15 times faster as compared to MapReduce in its firstiteration while Spark was more than 5 times faster in sub-sequent operations

Kang and Lee [47] examined five resource manage-ment frameworks including Apache Hadoop and Sparkwith respect to performance overheads (disk inputoutputnetwork communication scheduling etc) in supportingiterative computation The PageRank algorithm was used toevaluate these performance issues Since static data process-ing tends to be a more frequent operation than dynamic dataas it is used in every iteration of MapReduce it may causesignificant performance overhead in case of MapReduceOn the other hand Apache Spark uses read-only cachedversion of objects (resilient distributed dataset) which can bereused in parallel operations thus reducing the performanceoverhead during iterative computation Lee et al [48] evalu-ated five systems including Hadoop and Spark over variousworkloads to compare against four iterative algorithms Theexperimentation was performed on Amazon EC2 cloudOverall Spark showed the best performance when iterativeoperations were performed in main memory In contrast theperformance of Hadoop was significantly poor as comparedto other resource management systems

35 Latency Big data and low latency are strongly linkedBig data applications provide true value to businesses butthese are mostly time critical If cloud computing has to bethe successful platform for big data implementation one ofthe key requirements will be the provisioning of high-speednetwork to reduce communication latency Furthermore bigdata frameworks usually involve centralized design wherethe scheduler assigns all tasks through a single node whichmay significantly impact the latency when the size of data ishuge

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 4: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

4 Scientific Programming

widely used in conjunction with cloud computing environ-ment

21 Hadoop Hadoop [26] is a distributed programming andstorage infrastructure based on the open-source implementa-tion of theMapReducemodel [27]MapReduce is the first andcurrent de facto programming environment for developingdata-centric parallel applications for parsing and processinglarge datasetsTheMapReduce is inspired byMap andReduceprimitives used in functional programming In MapReduceprogramming users only have to write the logic of Mapperand Reducer while the process of shuffling partitioning andsorting is automatically handled by the execution engine [1427 28]The data can either be saved in theHadoop file systemas unstructured data or in a database as structured data [14]Hadoop Distributed File System (HDFS) is responsible forbreaking large data files into smaller pieces known as blocksThe blocks are placed on different data nodes and it is thejob of the NameNode to notice what blocks on which datanodes make up the complete file The NameNode also worksas a traffic cop handling all access to the files including readswrites creates deletes and replication of data blocks on thedata nodes A pipeline is a link between multiple data nodesthat exists to handle the transfer of data across the servers Auser application pushes a block to the first data node in thepipeline The data node takes over and forwards the block tothe next node in the pipeline this continues until all the dataand all the data replicas are saved to disk Afterwards theclient repeats the process by writing the next block in the file[25]

The two major components of Hadoop MapReduce arejob scheduling and tracking The early versions of Hadoopsupported limited job and task tracking system In particularthe earlier scheduler could not manage non-MapReducetasks and it was not capable of optimizing cluster utiliza-tion So a new capability was aimed at addressing theseshortcomings which may offer more flexibility scaling effi-ciency and performance Because of these issues Hadoop 20was introduced Alongside earlier HDFS resource manage-ment and MapReduce model it introduced a new resourcemanagement layer called Yet Another Resource Negotia-tor (YARN) that takes care of better resource utilization[25]

YARN is the core Hadoop service to provide two majorfunctionalities global resource management (ResourceMan-ager) and per-application management (ApplicationMaster)The ResourceManager is a master service which controlsNodeManager in each of the nodes of a Hadoop cluster Itincludes a scheduler whose main task is to allocate systemresources to specific running applications All the requiredsystem information is tracked by a Resource Containerwhich monitors CPU storage network and other importantresource attributes necessary for executing applications inthe cluster The ResourceManager has a slave NodeManagerservice to monitor application usage statistics Each deployedapplication is handled by a corresponding ApplicationMas-ter service If more resources are required to support therunning application the ApplicationMaster requests theNodeManager and the NodeManager negotiates with the

ResourceManager (scheduler) for the additional capacity onbehalf of the application [26]

22 Spark Apache Spark [29] originally developed as Berke-ley Spark was proposed as an alternative to Hadoop It canperform faster parallel computing operations by using in-memory primitives A job can load data in either local mem-ory or a cluster-wide shared memory and query it iterativelywith much great speed as compared to disk-based systemssuch as Hadoop MapReduce [27] Spark has been devel-oped for two applications where keeping data in memorymay significantly improve performance iterative machinelearning algorithms and interactive data mining Spark isalso intended to unify the current processing stack wherebatch processing is performed using MapReduce interac-tive queries are performed using HBase and the processingof streams for real-time analytics is performed using otherframeworks such Twitterrsquos Storm Spark offers programmersa functional programming paradigm with data-centric pro-gramming interfaces built on top of a new data model calledResilient Distributed Dataset (RDD) which is a collection ofobjects spread across a cluster stored in memory or disk [28]Applications in Spark can load these RDDs into the memoryof a cluster of nodes and let the Spark engine automaticallymanage the partitioning of the data and its locality duringruntime This versatile iterative model makes it possible tocontrol the persistence and manage the partitioning of dataA stream of incoming data can be partitioned into a seriesof batches and is processed as a sequence of small-batchjobs The Spark framework allows this seamless combinationof streaming and batch processing in a unified system Toprovide rapid application development Spark provides cleanconcise APIs in Scala Java and Python Spark can be usedinteractively from the Scala and Python shells to rapidlyquery big datasets

Spark is also the engine behind Shark a completeApache Hive-compatible data warehousing system that canrun much faster than Hive Spark also supports data accessfrom Hadoop Spark fits in seamlessly with the Hadoop 20ecosystem (Figure 2) as an alternative to MapReduce whileusing the same underlying infrastructure such as YARN andthe HDFS Spark is also an integral part of the SMACKstack to provide the most popular cloud-native PaaS suchas IoT predictive analytics and real-time personalizationfor big data In SMACK Apache Mesos cluster manager(instead of YARN) is used for dynamic allocation of clusterresources not only for running Hadoop applications but alsofor handling heterogeneous workloads

The GraphX and MLlib libraries include state-of-the-art graph and machine learning algorithms that can beexecuted in real time BlinkDB is a novel parallel sampling-based approximate query engine for running interactive SQLqueries that trade off query accuracy for response time withresults annotated bymeaningful error bars BlinkDBhas beenproven to run 200 times faster than Hive within an errorrate of 2ndash10 Moreover Spark provides an interactive toolcalled Spark Shell which allows exploiting the Spark clusterin real time Once interactive applications are created theymay subsequently be executed interactively in the cluster

Scientific Programming 5

MapReduce (Distributed Data Processing)

TEZ (Execution Engine)

Hardware

Hypervisor

OS

JVM

HDFS (Distributed File System)

YARN (Cluster Resource Management)

HbaseColumnOrientednoSQL

database

File DataLanguagePIG (Data

flow)

LanguageHive

(Queries SQL)

SearchLuceneOozie

(WorkflowScheduling)

Zookeeper (coordination)

Figure 2 Hadoop 20 ecosystem source [14]

MLlib SparkSQL Spark Streaming GraphX

Driver ProgramSpark Context

Cluster ManagerWorker Node

Worker Node Worker Node

Worker Node

Cluster Manager(i) Spark Standalone

(ii) YARN(iii) Mesos

DBMSHDFS Laptop Server

Figure 3 Spark architecture source [15]

In Figure 3 we present the general Spark system architec-ture

23 Flink Apache Flink is an emerging competitor of Sparkwhich offers functional programming interfaces much sim-ilar to Spark It shares many programming primitives andtransformations in the same way as what Spark does foriterative development predictive analysis and graph streamprocessing Flink is developed to fill the gap left by Sparkwhich uses minibatch streaming processing instead of apure streaming approach Flink ensures high processingperformance when dealing with complex big data structuressuch as graphs Flink programs are regular applicationswhichare written with a rich set of transformation operations (suchas mapping filtering grouping aggregating and joining) tothe input datasetsTheFlink dataset uses a table-basedmodeltherefore application developers can use index numbers tospecify a particular field of a dataset [27 28]

Flink is able to achieve high throughput and a low latencythereby processing a bundle of data very quickly Flink is

designed to run on large-scale clusters with thousands ofnodes and in addition to a standalone cluster mode Flinkprovides support for YARN For distributed environmentFlink chains operator subtasks together into tasks Each taskis executed by one thread [16] Flink runtime consists of twotypes of processes there is at least one JobManager (alsocalled masters) which coordinates the distributed executionIt schedules tasks coordinates checkpoints and coordinatesrecovery on failures A high-availability setup may involvemultiple JobManagers one of which one is always the leaderand the others are standby The TaskManagers (also calledworkers) execute the tasks (ormore specifically the subtasks)of a dataflowbuffer and exchange the data streams Theremust always be at least one TaskManager The JobManagersand TaskManagers can be started in various ways directly onthe machines as a standalone cluster in containers or man-aged by resource frameworks like YARN orMesos TaskMan-agers connect to JobManagers announcing themselves asavailable and are assigned work Figure 4 demonstrates themain components of Flink framework

24 Storm Storm [17] is a free open-source distributedstream processing computation framework It takes severalcharacteristics from the popular actor model and can beused with practically any kind of programming language fordeveloping applications such as real-time streaming analyticscritical work flow systems and data delivery services Theengine may process billions of tuples each day in a fault-tolerant way It can be integrated with popular resourcemanagement frameworks such as YARNMesos and DockerApache Storm cluster is made up of two types of processingactors spouts and bolts

(i) Spout is connected to the external data source of astream and is continuously emitting or collecting newdata for further processing

(ii) Bolt is a processing logic unit within a streamingprocessing topology each bolt is responsible for a

6 Scientific Programming

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

Data Streams

(Worker) (Worker)

Job Manager

Actor System

Scheduler

CheckpointCoordinator

Dataflow graph

Task StatusHeartbeatsStatistics

Deploystopcancel tasks

Triggercheckpoints

(MasterYARN Application Master)

Flink Program

Program code

OptimizerGraph Builder

Programdataflow

Client

ActorSystem

Dataflow graph

Statusupdates

Statisticsand

results

Cancelupdate job

Submit job (send dataflow)

Figure 4 Architectural components of Flink source [16]

certain processing task such as transformation filter-ing aggregating and partitioning

Storm defines workflow as directed acyclic graphs (DAGs)called topologies with connected spouts and bolts as verticesEdges in the graph define the link between the bolts andthe data stream Unlike batch jobs being only executedonce Storm jobs run forever until they are killed Thereare two types of nodes in a Storm cluster nimbus (masternode) and supervisor (worker node) Nimbus similar toHadoop JobTracker is the core component of Apache Stormand is responsible for distributing load across the clusterqueuing and assigning tasks to different processing units andmonitoring execution status Each worker node executes aprocess known as the supervisor which may have one ormore worker processes Supervisor delegates the tasks toworker processes Worker process then creates a subset oftopology to run the task Apache Storm does rely on aninternal distributed messaging system called Netty for thecommunication between nimbus and supervisors Zookeepermanages the communication between real-time job trackers(nimbus) and supervisors (Storm workers) Figure 5 outlinesthe high-level view of Storm cluster

25 Apache Samza Apache Samza [18] is a distributed streamprocessing framework mainly written in Scala and JavaOverall it has a relatively high throughput as well as some-what increased latency when compared to Storm [8] It usesApache Kafka which was originally developed for LinkedInfor messaging and streaming while Apache Hadoop YARNMesos is utilized as an execution platform for overall resourcemanagement Samza relies on Kafkarsquos semantics to define theway streams are handled Its main objective is to collect anddeliver massively large volumes of event data in particularlog data with a low latency A Kafka systemrsquos architecture iscomparatively simple as it only consists of a set of brokerswhich are individual nodes that make up a Kafka clusterData streams are defined by topics which is a stream ofrelated information that consumers can subscribe to Topicsare divided into partitions that are distributed over the brokerinstances for retrieving the corresponding messages using apull mechanismThe basic flow of job execution is presentedin Figure 6

Tables 1 and 2 present a brief comparative analysis of theseframeworks based on some common attributes As shownin the tables MapReduce computation data flow followschain of stages with no loop At each stage the program

Scientific Programming 7

Table1Com

paris

onof

bigdatafram

eworks

Attribute

Fram

ework

Hadoo

pSpark

Storm

Samza

Flink

Currentstablev

ersio

n281

220

111

0130

132

Batchprocessin

gYes

Yes

Yes

No

Yes

Com

putatio

nalm

odel

MapRe

duce

Stream

ing

(microbatches)

Stream

ing

(microbatches)

Stream

ing

Supp

ortscontinuo

usflo

wstream

ingmicrobatchandbatch

Datafl

owCh

ainof

stages

Dire

cted

acyclic

graph

Dire

cted

acyclic

graphs

(DAG

s)with

spou

tsandbo

lts

Stream

s(acyclic

graph)

Con

trolledcyclicd

ependency

graphthroug

hmachine

learning

Resource

managem

ent

YARN

YARN

Mesos

HDFS

(YARN

)Mesos

YARN

Mesos

ZookeeperYA

RNM

esos

Lang

uage

supp

ort

Allmajor

lang

uages

JavaScalaP

ytho

nandR

Any

programming

lang

uage

JVM

lang

uages

JavaScalaP

ytho

nandR

Jobmanagem

ento

ptim

ization

MapRe

duce

approach

Catalystextension

Storm-YARN

3rd-

partytoolslike

Ganglia

InternalJobR

unner

Internalop

timizer

Interactivem

ode

Non

e(3rd-partytools

likeImpalacanbe

integrated)

Interactives

hell

Non

eLimitedAPI

ofKa

fka

streams

Scalas

hell

Machine

learning

librarie

sAp

ache

Mahou

tH2O

SparkMLandMLlib

Trident-M

LAp

ache

SAMOA

Apache

SAMOA

Flink-ML

Maxim

umrepo

rted

nodes(scalability)

Yaho

oHadoo

pclu

ster

with

42000

nodes

8000

300

Link

edIn

with

arou

ndah

undred

node

cluste

rs

Alib

abac

ustomized

Flinkclu

ster

with

thou

sand

sofn

odes

8 Scientific Programming

Table 2 Comparative analysis of big data resource frameworks (119904 = 5)

Hadoop Spark Flink Storm SamzaProcessing speed permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermilFault tolerance permilpermilpermilpermil permilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilScalability permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil permilpermilpermilMachine learning permilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilLow latency permilpermil permilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilSecurity permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermil NDataset size permilpermilpermilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil

Task

Task

Task

Task

Executer Executer

Worker ProcessSupervisor

Worker node

Worker Process

Worker ProcessSupervisor

Worker node

Workers

NimbusMaster Node

Storm Cluster

Zookeeper Framework

Figure 5 Architecture of Storm Cluster source [17]

Kafka

Task

Task Task

YARN NodeManager

Machine

Samza Container

Samza Container

Samza Job

KafkaTopic 1

KafkaTopic 2

KafkaTopic 3

Figure 6 Samza architecture source [18]

proceeds with the output from the previous stage andgenerates an input for the next stage Although machinelearning algorithms are mostly designed in the form of cyclicdata flow Spark Storm and Samza represent it as directedacyclic graph to optimize the execution plan Flink supportscontrolled cyclic dependency graph in runtime to representthe machine learning algorithms in a very efficient way

Hadoop and Storm do not provide any default interactiveenvironment Apache Spark has a command-line interactiveshell to use the application features Flink provides a Scalashell to configure standalone as well as cluster setup ApacheHadoop is highly scalable and it has been used in theYahoo production consisting of 42000 nodes in 20 YARNclusters The largest known cluster size for Spark is of 8000computing nodes while Stormhas been tested on amaximumof 300 node clusters Apache Samza cluster with around ahundred nodes has been used in LinkedIn data flow andapplication messaging system Apache Flink has been custo-mized for Alibaba search engine with a deployment capacityof thousands of processing nodes

3 Comparative Evaluation ofBig Data Frameworks

Big data in cloud computing a popular research trendis posing significant influence on current enterprises ITindustries and research communities There are a numberof disruptive and transformative big data technologies andsolutions that are rapidly emanating and evolving in orderto provide data-driven insight and innovation Furthermoremodern cloud computing services are offering all kinds ofbig data analytic tools technologies and computing infras-tructures to speed up the data analysis process at an afford-able cost Although many distributed resource managementframeworks are available nowadays the main issue is how to

Scientific Programming 9

select a suitable big data framework The selection of one bigdata platform over the others will come down to the specificapplication requirements and constraints that may involveseveral tradeoffs and application usage scenarios Howeverwe can identify some key factors that need to be fulfilledbefore deploying a big data application in the cloud In thissection based on some empirical evidence from the availableliterature we discuss the advantages and disadvantages ofeach resource management framework

31 Processing Speed Processing speed is an important per-formance measurement that may be used to evaluate theeffectiveness of different resource management frameworksIt is a commonmetric for the maximum number of IO oper-ations to disk ormemory or the data transfer rate between thecomputational units of the cluster over a specific amount oftime Based on the context of big data the average processingspeed represented as119898 calculated after 119899 iterations run is themaximum amount of memorydisk intensive operations thatcan be performed over a time interval 119905119894

119898 = sum119899119894=1119898119894sum119899119894=1 119905119894 (1)

Veiga et al [30] conducted a series of experiments on amulticore cluster setup to demonstrate performance resultsof Apache Hadoop Spark and Flink Apache Spark andFlink resulted to be much efficient execution platforms overHadoop while performing nonsort benchmarks It was fur-ther noted that Spark showed better performance results foroperations such asWordCount and119870-Means (CPU-bound innature) while Flink achieved better results in PageRank algo-rithm (memory bound in nature) Mavridis and Karatza [31]experimentally compared performance statistics of ApacheHadoop and Spark onOkeanos IaaS cloud platform For eachset of experiments necessary statistics related to executiontime working nodes and the dataset size were recordedSpark performance was found optimal as compared toHadoop for most of the cases Furthermore Spark on YARNplatform showed suboptimal results as compared to the casewhen it was executed in standalone mode Some similarresults were also observed by Zaharia et al [32] on a 100GBdataset record Vellaipandiyan and Raja [33] demonstratedperformance evaluation and comparison of Hadoop andSpark frameworks on residentrsquos record dataset ranging from100GB to 900GB of size Spark scale of performance wasrelatively better when the dataset size was between small andmedium size (100GBndash750GB) afterwards its performancedeclined as compared to HadoopThe primary reason for theperformance decline was evident as Spark cache size couldnot fit into the memory for the larger dataset Taran et al[34] quantified performance differences ofHadoop and Sparkusing WordCount dataset which was ranging from 100KBto 1GB It was observed that Hadoop framework was fivetimes faster than Spark when the evaluation was performedusing a larger set of data sources However for the smallertasks Spark showed better performance results However thespeed-up ratio was decreased for both databases with thegrowth of input dataset

Gopalani and Arora [35] used 119870-Means algorithm onsome small medium and large location datasets to compareHadoop and Spark frameworksThe study results showed thatSpark performed up to three times better than MapReducefor most of the cases Bertoni et al [36] performed theexperimental evaluation of Apache Flink and Storm usinglarge genomic dataset data on Amazon EC2 cloud ApacheFlink was superior to Storm while performing histogramand map operations while Storm outperformed Flink whilegenomic join application was deployed

32 Fault Tolerance Fault tolerance is the characteristicthat enables a system to continue functioning in case ofthe failure of one or more components High-performancecomputing applications involve hundreds of nodes that areinterconnected to perform a specific task failing a nodeshould have zero or minimal effect on overall computationThe tolerance of a system represented as TolFT to meet itsrequirements after a disruption is the ratio of the time tocomplete tasks without observing any fault events to theoverall execution time where some fault events were detectedand the system state is reverted back to consistent state

TolFT =119879119909119879119909 + 1205902 (2)

where 119879119909 is the estimated correct execution time obtainedfrom a program run that is presumed to be fault-free or byaveraging the execution time from several application runsthat produce a known correct output and 1205902 represents thevariance in a programrsquos execution time due to the occurrenceof fault events For anHPC application that consists of a set ofcomputationally intensive tasks Γ = 1205911 1205912 120591119899 and sinceTolFT for each individual task is computed as Tol120591119894 then theoverall application resilience Tol may be calculated as [37]

Tol = 119899radicTol1205911 sdot Tol1205912 sdot sdot Tol120591119899 (3)

Lu et al [38] used StreamBench toolkit to evaluate perfor-mance and fault tolerance ability of Apache Spark StormSpark and Samza It was found that with the increased datasize Spark is much stable and fault-tolerant as compared toStorm but may be less efficient when compared to SamzaFurthermore when compared in terms of handling largecapacity of data both Samza and Spark outperformed StormGu and Li [39] used PageRank algorithm to perform acomparative experiment on Hadoop and Spark frameworksIt was observed that for smaller datasets such as wiki-Voteand soc-Slashdot0902 Spark outperformed Hadoop with amuch better margin However this speed-up result degradedwith the growth of dataset and for large datasets Hadoopeasily outperformed Spark Furthermore for massively largedatasets Spark was reported to be crashed with JVM heapexception while Hadoop still performed its task Lopez et al[40] evaluated throughput and fault tolerance mechanism ofApache Storm and Flink The experiments were based on athreat detection system where Apache Storm demonstratedbetter throughput as compared to Flink For fault tolerance

10 Scientific Programming

different virtual machines were manually turned off to ana-lyze the impact of node failures Apache Flink used its internalsubsystem to detect and migrate the failed tasks to othermachines and hence resulted in very few message lossesOn the other hand Storm took more time as Zookeeperinvolving some performance overhead was responsible forreporting the state of nimbus and thereafter processing thefailed task on other nodes

33 Scalability Scalability refers to the ability to accommo-date large loads or change in sizeworkload by provisioning ofresources at runtimeThis can be further categorized as scale-up (by making hardware stronger) or scale-down (by addingadditional nodes) One of the critical requirements of enter-prises is to process large volumes of data in a timely mannerto address high-value business problems Dynamic resourcescalability allows business entities to perform massive com-putation in parallel thus reducing overall time complexityand effort The definition of scalability comes from Amdahlrsquosand Gustafsonrsquos laws [41] Let 119882 be the size of workloadbefore the improvement of the system resources the fractionof the execution workload that benefits from the improve-ment of system resources is 120572 and the fraction concerningthe part that would not benefit from improvement in theresources is 1 minus 120572 When using an 119899-processor system userworkload is scaled to

= 120572119882 + (1 minus 120572) 119899119882 (4)

The parallel execution time of a scaled workload on 119899-processors is defined as scaled-workload speed-up 119878 as shownin

119878 = 119882 =120572119882 + (1 minus 120572) 119899119882119882 (5)

Garcıa-Gil et al [42] performed scalability and performancecomparison of Apache Spark and Flink using feature selec-tion framework to assemble multiple information theoreticcriteria into a single greedy algorithm The ECBDL14 datasetwas used to measure scalability factor for the frameworksIt was observed that Spark scalability performance was 4ndash10times faster than Flink Jakovits and Srirama [43] analyzedfour MapReduce based frameworks including Hadoop andSpark for benchmarking partitioning aroundMedoids Clus-tering Large Applications and Conjugate Gradient linearsystem solver algorithms using MPI All experiments wereperformed on 33 Amazon EC2 large instances cloud Forall algorithms Spark performed much better as comparedto Hadoop in terms of both performance and scalabilityBoden et al [44] aimed to investigate the scalability withrespect to both data size and dimensionality in order todemonstrate a fair and insightful benchmark that reflects therequirements of real-world machine learning applicationsThe benchmark was comprised of distributed optimizationalgorithms for supervised learning as well as algorithms forunsupervised learning For supervised learning they imple-mented machine learning algorithms by using Breeze librarywhile for unsupervised learning they chose the popular 119896-Means clustering algorithm in order to assess the scalability

of the two resource management frameworks The overallexecution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes whileSpark had a clear edge once enough main memory wasavailable due to the addition of new computing nodes

34 Machine Learning and Iterative Tasks Support Big dataapplications are inherently complex in nature and usuallyinvolve tasks and algorithms that are iterative in natureTheseapplications have distinct cyclic nature to achieve the desiredresult by continually repeating a set of tasks until these cannotbe substantially reduced further

Spangenberg et al [45] used real-world datasets consist-ing of four algorithms that is WordCount119870-Means PageR-ank and relational query to benchmark Apache Flink andStorm It was observed that Apache Storm performs better inbatchmode as compared to Flink However with the increas-ing complexity Apache Flink had a performance advantageover Storm and thus it was better suited for iterative dataor graph processing Shi et al [46] focused on analyz-ing Apache MapReduce and Spark for batch and iterativejobs For smaller datasets Spark resulted to be a better choicebut when experiments were performed on larger datasetsMapReduce turned out to be several times faster than SparkFor iterative operations such as 119870-Means Spark turned outto be 15 times faster as compared to MapReduce in its firstiteration while Spark was more than 5 times faster in sub-sequent operations

Kang and Lee [47] examined five resource manage-ment frameworks including Apache Hadoop and Sparkwith respect to performance overheads (disk inputoutputnetwork communication scheduling etc) in supportingiterative computation The PageRank algorithm was used toevaluate these performance issues Since static data process-ing tends to be a more frequent operation than dynamic dataas it is used in every iteration of MapReduce it may causesignificant performance overhead in case of MapReduceOn the other hand Apache Spark uses read-only cachedversion of objects (resilient distributed dataset) which can bereused in parallel operations thus reducing the performanceoverhead during iterative computation Lee et al [48] evalu-ated five systems including Hadoop and Spark over variousworkloads to compare against four iterative algorithms Theexperimentation was performed on Amazon EC2 cloudOverall Spark showed the best performance when iterativeoperations were performed in main memory In contrast theperformance of Hadoop was significantly poor as comparedto other resource management systems

35 Latency Big data and low latency are strongly linkedBig data applications provide true value to businesses butthese are mostly time critical If cloud computing has to bethe successful platform for big data implementation one ofthe key requirements will be the provisioning of high-speednetwork to reduce communication latency Furthermore bigdata frameworks usually involve centralized design wherethe scheduler assigns all tasks through a single node whichmay significantly impact the latency when the size of data ishuge

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 5: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

Scientific Programming 5

MapReduce (Distributed Data Processing)

TEZ (Execution Engine)

Hardware

Hypervisor

OS

JVM

HDFS (Distributed File System)

YARN (Cluster Resource Management)

HbaseColumnOrientednoSQL

database

File DataLanguagePIG (Data

flow)

LanguageHive

(Queries SQL)

SearchLuceneOozie

(WorkflowScheduling)

Zookeeper (coordination)

Figure 2 Hadoop 20 ecosystem source [14]

MLlib SparkSQL Spark Streaming GraphX

Driver ProgramSpark Context

Cluster ManagerWorker Node

Worker Node Worker Node

Worker Node

Cluster Manager(i) Spark Standalone

(ii) YARN(iii) Mesos

DBMSHDFS Laptop Server

Figure 3 Spark architecture source [15]

In Figure 3 we present the general Spark system architec-ture

23 Flink Apache Flink is an emerging competitor of Sparkwhich offers functional programming interfaces much sim-ilar to Spark It shares many programming primitives andtransformations in the same way as what Spark does foriterative development predictive analysis and graph streamprocessing Flink is developed to fill the gap left by Sparkwhich uses minibatch streaming processing instead of apure streaming approach Flink ensures high processingperformance when dealing with complex big data structuressuch as graphs Flink programs are regular applicationswhichare written with a rich set of transformation operations (suchas mapping filtering grouping aggregating and joining) tothe input datasetsTheFlink dataset uses a table-basedmodeltherefore application developers can use index numbers tospecify a particular field of a dataset [27 28]

Flink is able to achieve high throughput and a low latencythereby processing a bundle of data very quickly Flink is

designed to run on large-scale clusters with thousands ofnodes and in addition to a standalone cluster mode Flinkprovides support for YARN For distributed environmentFlink chains operator subtasks together into tasks Each taskis executed by one thread [16] Flink runtime consists of twotypes of processes there is at least one JobManager (alsocalled masters) which coordinates the distributed executionIt schedules tasks coordinates checkpoints and coordinatesrecovery on failures A high-availability setup may involvemultiple JobManagers one of which one is always the leaderand the others are standby The TaskManagers (also calledworkers) execute the tasks (ormore specifically the subtasks)of a dataflowbuffer and exchange the data streams Theremust always be at least one TaskManager The JobManagersand TaskManagers can be started in various ways directly onthe machines as a standalone cluster in containers or man-aged by resource frameworks like YARN orMesos TaskMan-agers connect to JobManagers announcing themselves asavailable and are assigned work Figure 4 demonstrates themain components of Flink framework

24 Storm Storm [17] is a free open-source distributedstream processing computation framework It takes severalcharacteristics from the popular actor model and can beused with practically any kind of programming language fordeveloping applications such as real-time streaming analyticscritical work flow systems and data delivery services Theengine may process billions of tuples each day in a fault-tolerant way It can be integrated with popular resourcemanagement frameworks such as YARNMesos and DockerApache Storm cluster is made up of two types of processingactors spouts and bolts

(i) Spout is connected to the external data source of astream and is continuously emitting or collecting newdata for further processing

(ii) Bolt is a processing logic unit within a streamingprocessing topology each bolt is responsible for a

6 Scientific Programming

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

Data Streams

(Worker) (Worker)

Job Manager

Actor System

Scheduler

CheckpointCoordinator

Dataflow graph

Task StatusHeartbeatsStatistics

Deploystopcancel tasks

Triggercheckpoints

(MasterYARN Application Master)

Flink Program

Program code

OptimizerGraph Builder

Programdataflow

Client

ActorSystem

Dataflow graph

Statusupdates

Statisticsand

results

Cancelupdate job

Submit job (send dataflow)

Figure 4 Architectural components of Flink source [16]

certain processing task such as transformation filter-ing aggregating and partitioning

Storm defines workflow as directed acyclic graphs (DAGs)called topologies with connected spouts and bolts as verticesEdges in the graph define the link between the bolts andthe data stream Unlike batch jobs being only executedonce Storm jobs run forever until they are killed Thereare two types of nodes in a Storm cluster nimbus (masternode) and supervisor (worker node) Nimbus similar toHadoop JobTracker is the core component of Apache Stormand is responsible for distributing load across the clusterqueuing and assigning tasks to different processing units andmonitoring execution status Each worker node executes aprocess known as the supervisor which may have one ormore worker processes Supervisor delegates the tasks toworker processes Worker process then creates a subset oftopology to run the task Apache Storm does rely on aninternal distributed messaging system called Netty for thecommunication between nimbus and supervisors Zookeepermanages the communication between real-time job trackers(nimbus) and supervisors (Storm workers) Figure 5 outlinesthe high-level view of Storm cluster

25 Apache Samza Apache Samza [18] is a distributed streamprocessing framework mainly written in Scala and JavaOverall it has a relatively high throughput as well as some-what increased latency when compared to Storm [8] It usesApache Kafka which was originally developed for LinkedInfor messaging and streaming while Apache Hadoop YARNMesos is utilized as an execution platform for overall resourcemanagement Samza relies on Kafkarsquos semantics to define theway streams are handled Its main objective is to collect anddeliver massively large volumes of event data in particularlog data with a low latency A Kafka systemrsquos architecture iscomparatively simple as it only consists of a set of brokerswhich are individual nodes that make up a Kafka clusterData streams are defined by topics which is a stream ofrelated information that consumers can subscribe to Topicsare divided into partitions that are distributed over the brokerinstances for retrieving the corresponding messages using apull mechanismThe basic flow of job execution is presentedin Figure 6

Tables 1 and 2 present a brief comparative analysis of theseframeworks based on some common attributes As shownin the tables MapReduce computation data flow followschain of stages with no loop At each stage the program

Scientific Programming 7

Table1Com

paris

onof

bigdatafram

eworks

Attribute

Fram

ework

Hadoo

pSpark

Storm

Samza

Flink

Currentstablev

ersio

n281

220

111

0130

132

Batchprocessin

gYes

Yes

Yes

No

Yes

Com

putatio

nalm

odel

MapRe

duce

Stream

ing

(microbatches)

Stream

ing

(microbatches)

Stream

ing

Supp

ortscontinuo

usflo

wstream

ingmicrobatchandbatch

Datafl

owCh

ainof

stages

Dire

cted

acyclic

graph

Dire

cted

acyclic

graphs

(DAG

s)with

spou

tsandbo

lts

Stream

s(acyclic

graph)

Con

trolledcyclicd

ependency

graphthroug

hmachine

learning

Resource

managem

ent

YARN

YARN

Mesos

HDFS

(YARN

)Mesos

YARN

Mesos

ZookeeperYA

RNM

esos

Lang

uage

supp

ort

Allmajor

lang

uages

JavaScalaP

ytho

nandR

Any

programming

lang

uage

JVM

lang

uages

JavaScalaP

ytho

nandR

Jobmanagem

ento

ptim

ization

MapRe

duce

approach

Catalystextension

Storm-YARN

3rd-

partytoolslike

Ganglia

InternalJobR

unner

Internalop

timizer

Interactivem

ode

Non

e(3rd-partytools

likeImpalacanbe

integrated)

Interactives

hell

Non

eLimitedAPI

ofKa

fka

streams

Scalas

hell

Machine

learning

librarie

sAp

ache

Mahou

tH2O

SparkMLandMLlib

Trident-M

LAp

ache

SAMOA

Apache

SAMOA

Flink-ML

Maxim

umrepo

rted

nodes(scalability)

Yaho

oHadoo

pclu

ster

with

42000

nodes

8000

300

Link

edIn

with

arou

ndah

undred

node

cluste

rs

Alib

abac

ustomized

Flinkclu

ster

with

thou

sand

sofn

odes

8 Scientific Programming

Table 2 Comparative analysis of big data resource frameworks (119904 = 5)

Hadoop Spark Flink Storm SamzaProcessing speed permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermilFault tolerance permilpermilpermilpermil permilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilScalability permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil permilpermilpermilMachine learning permilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilLow latency permilpermil permilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilSecurity permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermil NDataset size permilpermilpermilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil

Task

Task

Task

Task

Executer Executer

Worker ProcessSupervisor

Worker node

Worker Process

Worker ProcessSupervisor

Worker node

Workers

NimbusMaster Node

Storm Cluster

Zookeeper Framework

Figure 5 Architecture of Storm Cluster source [17]

Kafka

Task

Task Task

YARN NodeManager

Machine

Samza Container

Samza Container

Samza Job

KafkaTopic 1

KafkaTopic 2

KafkaTopic 3

Figure 6 Samza architecture source [18]

proceeds with the output from the previous stage andgenerates an input for the next stage Although machinelearning algorithms are mostly designed in the form of cyclicdata flow Spark Storm and Samza represent it as directedacyclic graph to optimize the execution plan Flink supportscontrolled cyclic dependency graph in runtime to representthe machine learning algorithms in a very efficient way

Hadoop and Storm do not provide any default interactiveenvironment Apache Spark has a command-line interactiveshell to use the application features Flink provides a Scalashell to configure standalone as well as cluster setup ApacheHadoop is highly scalable and it has been used in theYahoo production consisting of 42000 nodes in 20 YARNclusters The largest known cluster size for Spark is of 8000computing nodes while Stormhas been tested on amaximumof 300 node clusters Apache Samza cluster with around ahundred nodes has been used in LinkedIn data flow andapplication messaging system Apache Flink has been custo-mized for Alibaba search engine with a deployment capacityof thousands of processing nodes

3 Comparative Evaluation ofBig Data Frameworks

Big data in cloud computing a popular research trendis posing significant influence on current enterprises ITindustries and research communities There are a numberof disruptive and transformative big data technologies andsolutions that are rapidly emanating and evolving in orderto provide data-driven insight and innovation Furthermoremodern cloud computing services are offering all kinds ofbig data analytic tools technologies and computing infras-tructures to speed up the data analysis process at an afford-able cost Although many distributed resource managementframeworks are available nowadays the main issue is how to

Scientific Programming 9

select a suitable big data framework The selection of one bigdata platform over the others will come down to the specificapplication requirements and constraints that may involveseveral tradeoffs and application usage scenarios Howeverwe can identify some key factors that need to be fulfilledbefore deploying a big data application in the cloud In thissection based on some empirical evidence from the availableliterature we discuss the advantages and disadvantages ofeach resource management framework

31 Processing Speed Processing speed is an important per-formance measurement that may be used to evaluate theeffectiveness of different resource management frameworksIt is a commonmetric for the maximum number of IO oper-ations to disk ormemory or the data transfer rate between thecomputational units of the cluster over a specific amount oftime Based on the context of big data the average processingspeed represented as119898 calculated after 119899 iterations run is themaximum amount of memorydisk intensive operations thatcan be performed over a time interval 119905119894

119898 = sum119899119894=1119898119894sum119899119894=1 119905119894 (1)

Veiga et al [30] conducted a series of experiments on amulticore cluster setup to demonstrate performance resultsof Apache Hadoop Spark and Flink Apache Spark andFlink resulted to be much efficient execution platforms overHadoop while performing nonsort benchmarks It was fur-ther noted that Spark showed better performance results foroperations such asWordCount and119870-Means (CPU-bound innature) while Flink achieved better results in PageRank algo-rithm (memory bound in nature) Mavridis and Karatza [31]experimentally compared performance statistics of ApacheHadoop and Spark onOkeanos IaaS cloud platform For eachset of experiments necessary statistics related to executiontime working nodes and the dataset size were recordedSpark performance was found optimal as compared toHadoop for most of the cases Furthermore Spark on YARNplatform showed suboptimal results as compared to the casewhen it was executed in standalone mode Some similarresults were also observed by Zaharia et al [32] on a 100GBdataset record Vellaipandiyan and Raja [33] demonstratedperformance evaluation and comparison of Hadoop andSpark frameworks on residentrsquos record dataset ranging from100GB to 900GB of size Spark scale of performance wasrelatively better when the dataset size was between small andmedium size (100GBndash750GB) afterwards its performancedeclined as compared to HadoopThe primary reason for theperformance decline was evident as Spark cache size couldnot fit into the memory for the larger dataset Taran et al[34] quantified performance differences ofHadoop and Sparkusing WordCount dataset which was ranging from 100KBto 1GB It was observed that Hadoop framework was fivetimes faster than Spark when the evaluation was performedusing a larger set of data sources However for the smallertasks Spark showed better performance results However thespeed-up ratio was decreased for both databases with thegrowth of input dataset

Gopalani and Arora [35] used 119870-Means algorithm onsome small medium and large location datasets to compareHadoop and Spark frameworksThe study results showed thatSpark performed up to three times better than MapReducefor most of the cases Bertoni et al [36] performed theexperimental evaluation of Apache Flink and Storm usinglarge genomic dataset data on Amazon EC2 cloud ApacheFlink was superior to Storm while performing histogramand map operations while Storm outperformed Flink whilegenomic join application was deployed

32 Fault Tolerance Fault tolerance is the characteristicthat enables a system to continue functioning in case ofthe failure of one or more components High-performancecomputing applications involve hundreds of nodes that areinterconnected to perform a specific task failing a nodeshould have zero or minimal effect on overall computationThe tolerance of a system represented as TolFT to meet itsrequirements after a disruption is the ratio of the time tocomplete tasks without observing any fault events to theoverall execution time where some fault events were detectedand the system state is reverted back to consistent state

TolFT =119879119909119879119909 + 1205902 (2)

where 119879119909 is the estimated correct execution time obtainedfrom a program run that is presumed to be fault-free or byaveraging the execution time from several application runsthat produce a known correct output and 1205902 represents thevariance in a programrsquos execution time due to the occurrenceof fault events For anHPC application that consists of a set ofcomputationally intensive tasks Γ = 1205911 1205912 120591119899 and sinceTolFT for each individual task is computed as Tol120591119894 then theoverall application resilience Tol may be calculated as [37]

Tol = 119899radicTol1205911 sdot Tol1205912 sdot sdot Tol120591119899 (3)

Lu et al [38] used StreamBench toolkit to evaluate perfor-mance and fault tolerance ability of Apache Spark StormSpark and Samza It was found that with the increased datasize Spark is much stable and fault-tolerant as compared toStorm but may be less efficient when compared to SamzaFurthermore when compared in terms of handling largecapacity of data both Samza and Spark outperformed StormGu and Li [39] used PageRank algorithm to perform acomparative experiment on Hadoop and Spark frameworksIt was observed that for smaller datasets such as wiki-Voteand soc-Slashdot0902 Spark outperformed Hadoop with amuch better margin However this speed-up result degradedwith the growth of dataset and for large datasets Hadoopeasily outperformed Spark Furthermore for massively largedatasets Spark was reported to be crashed with JVM heapexception while Hadoop still performed its task Lopez et al[40] evaluated throughput and fault tolerance mechanism ofApache Storm and Flink The experiments were based on athreat detection system where Apache Storm demonstratedbetter throughput as compared to Flink For fault tolerance

10 Scientific Programming

different virtual machines were manually turned off to ana-lyze the impact of node failures Apache Flink used its internalsubsystem to detect and migrate the failed tasks to othermachines and hence resulted in very few message lossesOn the other hand Storm took more time as Zookeeperinvolving some performance overhead was responsible forreporting the state of nimbus and thereafter processing thefailed task on other nodes

33 Scalability Scalability refers to the ability to accommo-date large loads or change in sizeworkload by provisioning ofresources at runtimeThis can be further categorized as scale-up (by making hardware stronger) or scale-down (by addingadditional nodes) One of the critical requirements of enter-prises is to process large volumes of data in a timely mannerto address high-value business problems Dynamic resourcescalability allows business entities to perform massive com-putation in parallel thus reducing overall time complexityand effort The definition of scalability comes from Amdahlrsquosand Gustafsonrsquos laws [41] Let 119882 be the size of workloadbefore the improvement of the system resources the fractionof the execution workload that benefits from the improve-ment of system resources is 120572 and the fraction concerningthe part that would not benefit from improvement in theresources is 1 minus 120572 When using an 119899-processor system userworkload is scaled to

= 120572119882 + (1 minus 120572) 119899119882 (4)

The parallel execution time of a scaled workload on 119899-processors is defined as scaled-workload speed-up 119878 as shownin

119878 = 119882 =120572119882 + (1 minus 120572) 119899119882119882 (5)

Garcıa-Gil et al [42] performed scalability and performancecomparison of Apache Spark and Flink using feature selec-tion framework to assemble multiple information theoreticcriteria into a single greedy algorithm The ECBDL14 datasetwas used to measure scalability factor for the frameworksIt was observed that Spark scalability performance was 4ndash10times faster than Flink Jakovits and Srirama [43] analyzedfour MapReduce based frameworks including Hadoop andSpark for benchmarking partitioning aroundMedoids Clus-tering Large Applications and Conjugate Gradient linearsystem solver algorithms using MPI All experiments wereperformed on 33 Amazon EC2 large instances cloud Forall algorithms Spark performed much better as comparedto Hadoop in terms of both performance and scalabilityBoden et al [44] aimed to investigate the scalability withrespect to both data size and dimensionality in order todemonstrate a fair and insightful benchmark that reflects therequirements of real-world machine learning applicationsThe benchmark was comprised of distributed optimizationalgorithms for supervised learning as well as algorithms forunsupervised learning For supervised learning they imple-mented machine learning algorithms by using Breeze librarywhile for unsupervised learning they chose the popular 119896-Means clustering algorithm in order to assess the scalability

of the two resource management frameworks The overallexecution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes whileSpark had a clear edge once enough main memory wasavailable due to the addition of new computing nodes

34 Machine Learning and Iterative Tasks Support Big dataapplications are inherently complex in nature and usuallyinvolve tasks and algorithms that are iterative in natureTheseapplications have distinct cyclic nature to achieve the desiredresult by continually repeating a set of tasks until these cannotbe substantially reduced further

Spangenberg et al [45] used real-world datasets consist-ing of four algorithms that is WordCount119870-Means PageR-ank and relational query to benchmark Apache Flink andStorm It was observed that Apache Storm performs better inbatchmode as compared to Flink However with the increas-ing complexity Apache Flink had a performance advantageover Storm and thus it was better suited for iterative dataor graph processing Shi et al [46] focused on analyz-ing Apache MapReduce and Spark for batch and iterativejobs For smaller datasets Spark resulted to be a better choicebut when experiments were performed on larger datasetsMapReduce turned out to be several times faster than SparkFor iterative operations such as 119870-Means Spark turned outto be 15 times faster as compared to MapReduce in its firstiteration while Spark was more than 5 times faster in sub-sequent operations

Kang and Lee [47] examined five resource manage-ment frameworks including Apache Hadoop and Sparkwith respect to performance overheads (disk inputoutputnetwork communication scheduling etc) in supportingiterative computation The PageRank algorithm was used toevaluate these performance issues Since static data process-ing tends to be a more frequent operation than dynamic dataas it is used in every iteration of MapReduce it may causesignificant performance overhead in case of MapReduceOn the other hand Apache Spark uses read-only cachedversion of objects (resilient distributed dataset) which can bereused in parallel operations thus reducing the performanceoverhead during iterative computation Lee et al [48] evalu-ated five systems including Hadoop and Spark over variousworkloads to compare against four iterative algorithms Theexperimentation was performed on Amazon EC2 cloudOverall Spark showed the best performance when iterativeoperations were performed in main memory In contrast theperformance of Hadoop was significantly poor as comparedto other resource management systems

35 Latency Big data and low latency are strongly linkedBig data applications provide true value to businesses butthese are mostly time critical If cloud computing has to bethe successful platform for big data implementation one ofthe key requirements will be the provisioning of high-speednetwork to reduce communication latency Furthermore bigdata frameworks usually involve centralized design wherethe scheduler assigns all tasks through a single node whichmay significantly impact the latency when the size of data ishuge

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 6: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

6 Scientific Programming

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

TaskManager

TaskSlot

Task

TaskSlot

Task

TaskSlot

Memory amp IO Manager

Network Manager

Actor System

Data Streams

(Worker) (Worker)

Job Manager

Actor System

Scheduler

CheckpointCoordinator

Dataflow graph

Task StatusHeartbeatsStatistics

Deploystopcancel tasks

Triggercheckpoints

(MasterYARN Application Master)

Flink Program

Program code

OptimizerGraph Builder

Programdataflow

Client

ActorSystem

Dataflow graph

Statusupdates

Statisticsand

results

Cancelupdate job

Submit job (send dataflow)

Figure 4 Architectural components of Flink source [16]

certain processing task such as transformation filter-ing aggregating and partitioning

Storm defines workflow as directed acyclic graphs (DAGs)called topologies with connected spouts and bolts as verticesEdges in the graph define the link between the bolts andthe data stream Unlike batch jobs being only executedonce Storm jobs run forever until they are killed Thereare two types of nodes in a Storm cluster nimbus (masternode) and supervisor (worker node) Nimbus similar toHadoop JobTracker is the core component of Apache Stormand is responsible for distributing load across the clusterqueuing and assigning tasks to different processing units andmonitoring execution status Each worker node executes aprocess known as the supervisor which may have one ormore worker processes Supervisor delegates the tasks toworker processes Worker process then creates a subset oftopology to run the task Apache Storm does rely on aninternal distributed messaging system called Netty for thecommunication between nimbus and supervisors Zookeepermanages the communication between real-time job trackers(nimbus) and supervisors (Storm workers) Figure 5 outlinesthe high-level view of Storm cluster

25 Apache Samza Apache Samza [18] is a distributed streamprocessing framework mainly written in Scala and JavaOverall it has a relatively high throughput as well as some-what increased latency when compared to Storm [8] It usesApache Kafka which was originally developed for LinkedInfor messaging and streaming while Apache Hadoop YARNMesos is utilized as an execution platform for overall resourcemanagement Samza relies on Kafkarsquos semantics to define theway streams are handled Its main objective is to collect anddeliver massively large volumes of event data in particularlog data with a low latency A Kafka systemrsquos architecture iscomparatively simple as it only consists of a set of brokerswhich are individual nodes that make up a Kafka clusterData streams are defined by topics which is a stream ofrelated information that consumers can subscribe to Topicsare divided into partitions that are distributed over the brokerinstances for retrieving the corresponding messages using apull mechanismThe basic flow of job execution is presentedin Figure 6

Tables 1 and 2 present a brief comparative analysis of theseframeworks based on some common attributes As shownin the tables MapReduce computation data flow followschain of stages with no loop At each stage the program

Scientific Programming 7

Table1Com

paris

onof

bigdatafram

eworks

Attribute

Fram

ework

Hadoo

pSpark

Storm

Samza

Flink

Currentstablev

ersio

n281

220

111

0130

132

Batchprocessin

gYes

Yes

Yes

No

Yes

Com

putatio

nalm

odel

MapRe

duce

Stream

ing

(microbatches)

Stream

ing

(microbatches)

Stream

ing

Supp

ortscontinuo

usflo

wstream

ingmicrobatchandbatch

Datafl

owCh

ainof

stages

Dire

cted

acyclic

graph

Dire

cted

acyclic

graphs

(DAG

s)with

spou

tsandbo

lts

Stream

s(acyclic

graph)

Con

trolledcyclicd

ependency

graphthroug

hmachine

learning

Resource

managem

ent

YARN

YARN

Mesos

HDFS

(YARN

)Mesos

YARN

Mesos

ZookeeperYA

RNM

esos

Lang

uage

supp

ort

Allmajor

lang

uages

JavaScalaP

ytho

nandR

Any

programming

lang

uage

JVM

lang

uages

JavaScalaP

ytho

nandR

Jobmanagem

ento

ptim

ization

MapRe

duce

approach

Catalystextension

Storm-YARN

3rd-

partytoolslike

Ganglia

InternalJobR

unner

Internalop

timizer

Interactivem

ode

Non

e(3rd-partytools

likeImpalacanbe

integrated)

Interactives

hell

Non

eLimitedAPI

ofKa

fka

streams

Scalas

hell

Machine

learning

librarie

sAp

ache

Mahou

tH2O

SparkMLandMLlib

Trident-M

LAp

ache

SAMOA

Apache

SAMOA

Flink-ML

Maxim

umrepo

rted

nodes(scalability)

Yaho

oHadoo

pclu

ster

with

42000

nodes

8000

300

Link

edIn

with

arou

ndah

undred

node

cluste

rs

Alib

abac

ustomized

Flinkclu

ster

with

thou

sand

sofn

odes

8 Scientific Programming

Table 2 Comparative analysis of big data resource frameworks (119904 = 5)

Hadoop Spark Flink Storm SamzaProcessing speed permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermilFault tolerance permilpermilpermilpermil permilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilScalability permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil permilpermilpermilMachine learning permilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilLow latency permilpermil permilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilSecurity permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermil NDataset size permilpermilpermilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil

Task

Task

Task

Task

Executer Executer

Worker ProcessSupervisor

Worker node

Worker Process

Worker ProcessSupervisor

Worker node

Workers

NimbusMaster Node

Storm Cluster

Zookeeper Framework

Figure 5 Architecture of Storm Cluster source [17]

Kafka

Task

Task Task

YARN NodeManager

Machine

Samza Container

Samza Container

Samza Job

KafkaTopic 1

KafkaTopic 2

KafkaTopic 3

Figure 6 Samza architecture source [18]

proceeds with the output from the previous stage andgenerates an input for the next stage Although machinelearning algorithms are mostly designed in the form of cyclicdata flow Spark Storm and Samza represent it as directedacyclic graph to optimize the execution plan Flink supportscontrolled cyclic dependency graph in runtime to representthe machine learning algorithms in a very efficient way

Hadoop and Storm do not provide any default interactiveenvironment Apache Spark has a command-line interactiveshell to use the application features Flink provides a Scalashell to configure standalone as well as cluster setup ApacheHadoop is highly scalable and it has been used in theYahoo production consisting of 42000 nodes in 20 YARNclusters The largest known cluster size for Spark is of 8000computing nodes while Stormhas been tested on amaximumof 300 node clusters Apache Samza cluster with around ahundred nodes has been used in LinkedIn data flow andapplication messaging system Apache Flink has been custo-mized for Alibaba search engine with a deployment capacityof thousands of processing nodes

3 Comparative Evaluation ofBig Data Frameworks

Big data in cloud computing a popular research trendis posing significant influence on current enterprises ITindustries and research communities There are a numberof disruptive and transformative big data technologies andsolutions that are rapidly emanating and evolving in orderto provide data-driven insight and innovation Furthermoremodern cloud computing services are offering all kinds ofbig data analytic tools technologies and computing infras-tructures to speed up the data analysis process at an afford-able cost Although many distributed resource managementframeworks are available nowadays the main issue is how to

Scientific Programming 9

select a suitable big data framework The selection of one bigdata platform over the others will come down to the specificapplication requirements and constraints that may involveseveral tradeoffs and application usage scenarios Howeverwe can identify some key factors that need to be fulfilledbefore deploying a big data application in the cloud In thissection based on some empirical evidence from the availableliterature we discuss the advantages and disadvantages ofeach resource management framework

31 Processing Speed Processing speed is an important per-formance measurement that may be used to evaluate theeffectiveness of different resource management frameworksIt is a commonmetric for the maximum number of IO oper-ations to disk ormemory or the data transfer rate between thecomputational units of the cluster over a specific amount oftime Based on the context of big data the average processingspeed represented as119898 calculated after 119899 iterations run is themaximum amount of memorydisk intensive operations thatcan be performed over a time interval 119905119894

119898 = sum119899119894=1119898119894sum119899119894=1 119905119894 (1)

Veiga et al [30] conducted a series of experiments on amulticore cluster setup to demonstrate performance resultsof Apache Hadoop Spark and Flink Apache Spark andFlink resulted to be much efficient execution platforms overHadoop while performing nonsort benchmarks It was fur-ther noted that Spark showed better performance results foroperations such asWordCount and119870-Means (CPU-bound innature) while Flink achieved better results in PageRank algo-rithm (memory bound in nature) Mavridis and Karatza [31]experimentally compared performance statistics of ApacheHadoop and Spark onOkeanos IaaS cloud platform For eachset of experiments necessary statistics related to executiontime working nodes and the dataset size were recordedSpark performance was found optimal as compared toHadoop for most of the cases Furthermore Spark on YARNplatform showed suboptimal results as compared to the casewhen it was executed in standalone mode Some similarresults were also observed by Zaharia et al [32] on a 100GBdataset record Vellaipandiyan and Raja [33] demonstratedperformance evaluation and comparison of Hadoop andSpark frameworks on residentrsquos record dataset ranging from100GB to 900GB of size Spark scale of performance wasrelatively better when the dataset size was between small andmedium size (100GBndash750GB) afterwards its performancedeclined as compared to HadoopThe primary reason for theperformance decline was evident as Spark cache size couldnot fit into the memory for the larger dataset Taran et al[34] quantified performance differences ofHadoop and Sparkusing WordCount dataset which was ranging from 100KBto 1GB It was observed that Hadoop framework was fivetimes faster than Spark when the evaluation was performedusing a larger set of data sources However for the smallertasks Spark showed better performance results However thespeed-up ratio was decreased for both databases with thegrowth of input dataset

Gopalani and Arora [35] used 119870-Means algorithm onsome small medium and large location datasets to compareHadoop and Spark frameworksThe study results showed thatSpark performed up to three times better than MapReducefor most of the cases Bertoni et al [36] performed theexperimental evaluation of Apache Flink and Storm usinglarge genomic dataset data on Amazon EC2 cloud ApacheFlink was superior to Storm while performing histogramand map operations while Storm outperformed Flink whilegenomic join application was deployed

32 Fault Tolerance Fault tolerance is the characteristicthat enables a system to continue functioning in case ofthe failure of one or more components High-performancecomputing applications involve hundreds of nodes that areinterconnected to perform a specific task failing a nodeshould have zero or minimal effect on overall computationThe tolerance of a system represented as TolFT to meet itsrequirements after a disruption is the ratio of the time tocomplete tasks without observing any fault events to theoverall execution time where some fault events were detectedand the system state is reverted back to consistent state

TolFT =119879119909119879119909 + 1205902 (2)

where 119879119909 is the estimated correct execution time obtainedfrom a program run that is presumed to be fault-free or byaveraging the execution time from several application runsthat produce a known correct output and 1205902 represents thevariance in a programrsquos execution time due to the occurrenceof fault events For anHPC application that consists of a set ofcomputationally intensive tasks Γ = 1205911 1205912 120591119899 and sinceTolFT for each individual task is computed as Tol120591119894 then theoverall application resilience Tol may be calculated as [37]

Tol = 119899radicTol1205911 sdot Tol1205912 sdot sdot Tol120591119899 (3)

Lu et al [38] used StreamBench toolkit to evaluate perfor-mance and fault tolerance ability of Apache Spark StormSpark and Samza It was found that with the increased datasize Spark is much stable and fault-tolerant as compared toStorm but may be less efficient when compared to SamzaFurthermore when compared in terms of handling largecapacity of data both Samza and Spark outperformed StormGu and Li [39] used PageRank algorithm to perform acomparative experiment on Hadoop and Spark frameworksIt was observed that for smaller datasets such as wiki-Voteand soc-Slashdot0902 Spark outperformed Hadoop with amuch better margin However this speed-up result degradedwith the growth of dataset and for large datasets Hadoopeasily outperformed Spark Furthermore for massively largedatasets Spark was reported to be crashed with JVM heapexception while Hadoop still performed its task Lopez et al[40] evaluated throughput and fault tolerance mechanism ofApache Storm and Flink The experiments were based on athreat detection system where Apache Storm demonstratedbetter throughput as compared to Flink For fault tolerance

10 Scientific Programming

different virtual machines were manually turned off to ana-lyze the impact of node failures Apache Flink used its internalsubsystem to detect and migrate the failed tasks to othermachines and hence resulted in very few message lossesOn the other hand Storm took more time as Zookeeperinvolving some performance overhead was responsible forreporting the state of nimbus and thereafter processing thefailed task on other nodes

33 Scalability Scalability refers to the ability to accommo-date large loads or change in sizeworkload by provisioning ofresources at runtimeThis can be further categorized as scale-up (by making hardware stronger) or scale-down (by addingadditional nodes) One of the critical requirements of enter-prises is to process large volumes of data in a timely mannerto address high-value business problems Dynamic resourcescalability allows business entities to perform massive com-putation in parallel thus reducing overall time complexityand effort The definition of scalability comes from Amdahlrsquosand Gustafsonrsquos laws [41] Let 119882 be the size of workloadbefore the improvement of the system resources the fractionof the execution workload that benefits from the improve-ment of system resources is 120572 and the fraction concerningthe part that would not benefit from improvement in theresources is 1 minus 120572 When using an 119899-processor system userworkload is scaled to

= 120572119882 + (1 minus 120572) 119899119882 (4)

The parallel execution time of a scaled workload on 119899-processors is defined as scaled-workload speed-up 119878 as shownin

119878 = 119882 =120572119882 + (1 minus 120572) 119899119882119882 (5)

Garcıa-Gil et al [42] performed scalability and performancecomparison of Apache Spark and Flink using feature selec-tion framework to assemble multiple information theoreticcriteria into a single greedy algorithm The ECBDL14 datasetwas used to measure scalability factor for the frameworksIt was observed that Spark scalability performance was 4ndash10times faster than Flink Jakovits and Srirama [43] analyzedfour MapReduce based frameworks including Hadoop andSpark for benchmarking partitioning aroundMedoids Clus-tering Large Applications and Conjugate Gradient linearsystem solver algorithms using MPI All experiments wereperformed on 33 Amazon EC2 large instances cloud Forall algorithms Spark performed much better as comparedto Hadoop in terms of both performance and scalabilityBoden et al [44] aimed to investigate the scalability withrespect to both data size and dimensionality in order todemonstrate a fair and insightful benchmark that reflects therequirements of real-world machine learning applicationsThe benchmark was comprised of distributed optimizationalgorithms for supervised learning as well as algorithms forunsupervised learning For supervised learning they imple-mented machine learning algorithms by using Breeze librarywhile for unsupervised learning they chose the popular 119896-Means clustering algorithm in order to assess the scalability

of the two resource management frameworks The overallexecution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes whileSpark had a clear edge once enough main memory wasavailable due to the addition of new computing nodes

34 Machine Learning and Iterative Tasks Support Big dataapplications are inherently complex in nature and usuallyinvolve tasks and algorithms that are iterative in natureTheseapplications have distinct cyclic nature to achieve the desiredresult by continually repeating a set of tasks until these cannotbe substantially reduced further

Spangenberg et al [45] used real-world datasets consist-ing of four algorithms that is WordCount119870-Means PageR-ank and relational query to benchmark Apache Flink andStorm It was observed that Apache Storm performs better inbatchmode as compared to Flink However with the increas-ing complexity Apache Flink had a performance advantageover Storm and thus it was better suited for iterative dataor graph processing Shi et al [46] focused on analyz-ing Apache MapReduce and Spark for batch and iterativejobs For smaller datasets Spark resulted to be a better choicebut when experiments were performed on larger datasetsMapReduce turned out to be several times faster than SparkFor iterative operations such as 119870-Means Spark turned outto be 15 times faster as compared to MapReduce in its firstiteration while Spark was more than 5 times faster in sub-sequent operations

Kang and Lee [47] examined five resource manage-ment frameworks including Apache Hadoop and Sparkwith respect to performance overheads (disk inputoutputnetwork communication scheduling etc) in supportingiterative computation The PageRank algorithm was used toevaluate these performance issues Since static data process-ing tends to be a more frequent operation than dynamic dataas it is used in every iteration of MapReduce it may causesignificant performance overhead in case of MapReduceOn the other hand Apache Spark uses read-only cachedversion of objects (resilient distributed dataset) which can bereused in parallel operations thus reducing the performanceoverhead during iterative computation Lee et al [48] evalu-ated five systems including Hadoop and Spark over variousworkloads to compare against four iterative algorithms Theexperimentation was performed on Amazon EC2 cloudOverall Spark showed the best performance when iterativeoperations were performed in main memory In contrast theperformance of Hadoop was significantly poor as comparedto other resource management systems

35 Latency Big data and low latency are strongly linkedBig data applications provide true value to businesses butthese are mostly time critical If cloud computing has to bethe successful platform for big data implementation one ofthe key requirements will be the provisioning of high-speednetwork to reduce communication latency Furthermore bigdata frameworks usually involve centralized design wherethe scheduler assigns all tasks through a single node whichmay significantly impact the latency when the size of data ishuge

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 7: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

Scientific Programming 7

Table1Com

paris

onof

bigdatafram

eworks

Attribute

Fram

ework

Hadoo

pSpark

Storm

Samza

Flink

Currentstablev

ersio

n281

220

111

0130

132

Batchprocessin

gYes

Yes

Yes

No

Yes

Com

putatio

nalm

odel

MapRe

duce

Stream

ing

(microbatches)

Stream

ing

(microbatches)

Stream

ing

Supp

ortscontinuo

usflo

wstream

ingmicrobatchandbatch

Datafl

owCh

ainof

stages

Dire

cted

acyclic

graph

Dire

cted

acyclic

graphs

(DAG

s)with

spou

tsandbo

lts

Stream

s(acyclic

graph)

Con

trolledcyclicd

ependency

graphthroug

hmachine

learning

Resource

managem

ent

YARN

YARN

Mesos

HDFS

(YARN

)Mesos

YARN

Mesos

ZookeeperYA

RNM

esos

Lang

uage

supp

ort

Allmajor

lang

uages

JavaScalaP

ytho

nandR

Any

programming

lang

uage

JVM

lang

uages

JavaScalaP

ytho

nandR

Jobmanagem

ento

ptim

ization

MapRe

duce

approach

Catalystextension

Storm-YARN

3rd-

partytoolslike

Ganglia

InternalJobR

unner

Internalop

timizer

Interactivem

ode

Non

e(3rd-partytools

likeImpalacanbe

integrated)

Interactives

hell

Non

eLimitedAPI

ofKa

fka

streams

Scalas

hell

Machine

learning

librarie

sAp

ache

Mahou

tH2O

SparkMLandMLlib

Trident-M

LAp

ache

SAMOA

Apache

SAMOA

Flink-ML

Maxim

umrepo

rted

nodes(scalability)

Yaho

oHadoo

pclu

ster

with

42000

nodes

8000

300

Link

edIn

with

arou

ndah

undred

node

cluste

rs

Alib

abac

ustomized

Flinkclu

ster

with

thou

sand

sofn

odes

8 Scientific Programming

Table 2 Comparative analysis of big data resource frameworks (119904 = 5)

Hadoop Spark Flink Storm SamzaProcessing speed permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermilFault tolerance permilpermilpermilpermil permilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilScalability permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil permilpermilpermilMachine learning permilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilLow latency permilpermil permilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilSecurity permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermil NDataset size permilpermilpermilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil

Task

Task

Task

Task

Executer Executer

Worker ProcessSupervisor

Worker node

Worker Process

Worker ProcessSupervisor

Worker node

Workers

NimbusMaster Node

Storm Cluster

Zookeeper Framework

Figure 5 Architecture of Storm Cluster source [17]

Kafka

Task

Task Task

YARN NodeManager

Machine

Samza Container

Samza Container

Samza Job

KafkaTopic 1

KafkaTopic 2

KafkaTopic 3

Figure 6 Samza architecture source [18]

proceeds with the output from the previous stage andgenerates an input for the next stage Although machinelearning algorithms are mostly designed in the form of cyclicdata flow Spark Storm and Samza represent it as directedacyclic graph to optimize the execution plan Flink supportscontrolled cyclic dependency graph in runtime to representthe machine learning algorithms in a very efficient way

Hadoop and Storm do not provide any default interactiveenvironment Apache Spark has a command-line interactiveshell to use the application features Flink provides a Scalashell to configure standalone as well as cluster setup ApacheHadoop is highly scalable and it has been used in theYahoo production consisting of 42000 nodes in 20 YARNclusters The largest known cluster size for Spark is of 8000computing nodes while Stormhas been tested on amaximumof 300 node clusters Apache Samza cluster with around ahundred nodes has been used in LinkedIn data flow andapplication messaging system Apache Flink has been custo-mized for Alibaba search engine with a deployment capacityof thousands of processing nodes

3 Comparative Evaluation ofBig Data Frameworks

Big data in cloud computing a popular research trendis posing significant influence on current enterprises ITindustries and research communities There are a numberof disruptive and transformative big data technologies andsolutions that are rapidly emanating and evolving in orderto provide data-driven insight and innovation Furthermoremodern cloud computing services are offering all kinds ofbig data analytic tools technologies and computing infras-tructures to speed up the data analysis process at an afford-able cost Although many distributed resource managementframeworks are available nowadays the main issue is how to

Scientific Programming 9

select a suitable big data framework The selection of one bigdata platform over the others will come down to the specificapplication requirements and constraints that may involveseveral tradeoffs and application usage scenarios Howeverwe can identify some key factors that need to be fulfilledbefore deploying a big data application in the cloud In thissection based on some empirical evidence from the availableliterature we discuss the advantages and disadvantages ofeach resource management framework

31 Processing Speed Processing speed is an important per-formance measurement that may be used to evaluate theeffectiveness of different resource management frameworksIt is a commonmetric for the maximum number of IO oper-ations to disk ormemory or the data transfer rate between thecomputational units of the cluster over a specific amount oftime Based on the context of big data the average processingspeed represented as119898 calculated after 119899 iterations run is themaximum amount of memorydisk intensive operations thatcan be performed over a time interval 119905119894

119898 = sum119899119894=1119898119894sum119899119894=1 119905119894 (1)

Veiga et al [30] conducted a series of experiments on amulticore cluster setup to demonstrate performance resultsof Apache Hadoop Spark and Flink Apache Spark andFlink resulted to be much efficient execution platforms overHadoop while performing nonsort benchmarks It was fur-ther noted that Spark showed better performance results foroperations such asWordCount and119870-Means (CPU-bound innature) while Flink achieved better results in PageRank algo-rithm (memory bound in nature) Mavridis and Karatza [31]experimentally compared performance statistics of ApacheHadoop and Spark onOkeanos IaaS cloud platform For eachset of experiments necessary statistics related to executiontime working nodes and the dataset size were recordedSpark performance was found optimal as compared toHadoop for most of the cases Furthermore Spark on YARNplatform showed suboptimal results as compared to the casewhen it was executed in standalone mode Some similarresults were also observed by Zaharia et al [32] on a 100GBdataset record Vellaipandiyan and Raja [33] demonstratedperformance evaluation and comparison of Hadoop andSpark frameworks on residentrsquos record dataset ranging from100GB to 900GB of size Spark scale of performance wasrelatively better when the dataset size was between small andmedium size (100GBndash750GB) afterwards its performancedeclined as compared to HadoopThe primary reason for theperformance decline was evident as Spark cache size couldnot fit into the memory for the larger dataset Taran et al[34] quantified performance differences ofHadoop and Sparkusing WordCount dataset which was ranging from 100KBto 1GB It was observed that Hadoop framework was fivetimes faster than Spark when the evaluation was performedusing a larger set of data sources However for the smallertasks Spark showed better performance results However thespeed-up ratio was decreased for both databases with thegrowth of input dataset

Gopalani and Arora [35] used 119870-Means algorithm onsome small medium and large location datasets to compareHadoop and Spark frameworksThe study results showed thatSpark performed up to three times better than MapReducefor most of the cases Bertoni et al [36] performed theexperimental evaluation of Apache Flink and Storm usinglarge genomic dataset data on Amazon EC2 cloud ApacheFlink was superior to Storm while performing histogramand map operations while Storm outperformed Flink whilegenomic join application was deployed

32 Fault Tolerance Fault tolerance is the characteristicthat enables a system to continue functioning in case ofthe failure of one or more components High-performancecomputing applications involve hundreds of nodes that areinterconnected to perform a specific task failing a nodeshould have zero or minimal effect on overall computationThe tolerance of a system represented as TolFT to meet itsrequirements after a disruption is the ratio of the time tocomplete tasks without observing any fault events to theoverall execution time where some fault events were detectedand the system state is reverted back to consistent state

TolFT =119879119909119879119909 + 1205902 (2)

where 119879119909 is the estimated correct execution time obtainedfrom a program run that is presumed to be fault-free or byaveraging the execution time from several application runsthat produce a known correct output and 1205902 represents thevariance in a programrsquos execution time due to the occurrenceof fault events For anHPC application that consists of a set ofcomputationally intensive tasks Γ = 1205911 1205912 120591119899 and sinceTolFT for each individual task is computed as Tol120591119894 then theoverall application resilience Tol may be calculated as [37]

Tol = 119899radicTol1205911 sdot Tol1205912 sdot sdot Tol120591119899 (3)

Lu et al [38] used StreamBench toolkit to evaluate perfor-mance and fault tolerance ability of Apache Spark StormSpark and Samza It was found that with the increased datasize Spark is much stable and fault-tolerant as compared toStorm but may be less efficient when compared to SamzaFurthermore when compared in terms of handling largecapacity of data both Samza and Spark outperformed StormGu and Li [39] used PageRank algorithm to perform acomparative experiment on Hadoop and Spark frameworksIt was observed that for smaller datasets such as wiki-Voteand soc-Slashdot0902 Spark outperformed Hadoop with amuch better margin However this speed-up result degradedwith the growth of dataset and for large datasets Hadoopeasily outperformed Spark Furthermore for massively largedatasets Spark was reported to be crashed with JVM heapexception while Hadoop still performed its task Lopez et al[40] evaluated throughput and fault tolerance mechanism ofApache Storm and Flink The experiments were based on athreat detection system where Apache Storm demonstratedbetter throughput as compared to Flink For fault tolerance

10 Scientific Programming

different virtual machines were manually turned off to ana-lyze the impact of node failures Apache Flink used its internalsubsystem to detect and migrate the failed tasks to othermachines and hence resulted in very few message lossesOn the other hand Storm took more time as Zookeeperinvolving some performance overhead was responsible forreporting the state of nimbus and thereafter processing thefailed task on other nodes

33 Scalability Scalability refers to the ability to accommo-date large loads or change in sizeworkload by provisioning ofresources at runtimeThis can be further categorized as scale-up (by making hardware stronger) or scale-down (by addingadditional nodes) One of the critical requirements of enter-prises is to process large volumes of data in a timely mannerto address high-value business problems Dynamic resourcescalability allows business entities to perform massive com-putation in parallel thus reducing overall time complexityand effort The definition of scalability comes from Amdahlrsquosand Gustafsonrsquos laws [41] Let 119882 be the size of workloadbefore the improvement of the system resources the fractionof the execution workload that benefits from the improve-ment of system resources is 120572 and the fraction concerningthe part that would not benefit from improvement in theresources is 1 minus 120572 When using an 119899-processor system userworkload is scaled to

= 120572119882 + (1 minus 120572) 119899119882 (4)

The parallel execution time of a scaled workload on 119899-processors is defined as scaled-workload speed-up 119878 as shownin

119878 = 119882 =120572119882 + (1 minus 120572) 119899119882119882 (5)

Garcıa-Gil et al [42] performed scalability and performancecomparison of Apache Spark and Flink using feature selec-tion framework to assemble multiple information theoreticcriteria into a single greedy algorithm The ECBDL14 datasetwas used to measure scalability factor for the frameworksIt was observed that Spark scalability performance was 4ndash10times faster than Flink Jakovits and Srirama [43] analyzedfour MapReduce based frameworks including Hadoop andSpark for benchmarking partitioning aroundMedoids Clus-tering Large Applications and Conjugate Gradient linearsystem solver algorithms using MPI All experiments wereperformed on 33 Amazon EC2 large instances cloud Forall algorithms Spark performed much better as comparedto Hadoop in terms of both performance and scalabilityBoden et al [44] aimed to investigate the scalability withrespect to both data size and dimensionality in order todemonstrate a fair and insightful benchmark that reflects therequirements of real-world machine learning applicationsThe benchmark was comprised of distributed optimizationalgorithms for supervised learning as well as algorithms forunsupervised learning For supervised learning they imple-mented machine learning algorithms by using Breeze librarywhile for unsupervised learning they chose the popular 119896-Means clustering algorithm in order to assess the scalability

of the two resource management frameworks The overallexecution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes whileSpark had a clear edge once enough main memory wasavailable due to the addition of new computing nodes

34 Machine Learning and Iterative Tasks Support Big dataapplications are inherently complex in nature and usuallyinvolve tasks and algorithms that are iterative in natureTheseapplications have distinct cyclic nature to achieve the desiredresult by continually repeating a set of tasks until these cannotbe substantially reduced further

Spangenberg et al [45] used real-world datasets consist-ing of four algorithms that is WordCount119870-Means PageR-ank and relational query to benchmark Apache Flink andStorm It was observed that Apache Storm performs better inbatchmode as compared to Flink However with the increas-ing complexity Apache Flink had a performance advantageover Storm and thus it was better suited for iterative dataor graph processing Shi et al [46] focused on analyz-ing Apache MapReduce and Spark for batch and iterativejobs For smaller datasets Spark resulted to be a better choicebut when experiments were performed on larger datasetsMapReduce turned out to be several times faster than SparkFor iterative operations such as 119870-Means Spark turned outto be 15 times faster as compared to MapReduce in its firstiteration while Spark was more than 5 times faster in sub-sequent operations

Kang and Lee [47] examined five resource manage-ment frameworks including Apache Hadoop and Sparkwith respect to performance overheads (disk inputoutputnetwork communication scheduling etc) in supportingiterative computation The PageRank algorithm was used toevaluate these performance issues Since static data process-ing tends to be a more frequent operation than dynamic dataas it is used in every iteration of MapReduce it may causesignificant performance overhead in case of MapReduceOn the other hand Apache Spark uses read-only cachedversion of objects (resilient distributed dataset) which can bereused in parallel operations thus reducing the performanceoverhead during iterative computation Lee et al [48] evalu-ated five systems including Hadoop and Spark over variousworkloads to compare against four iterative algorithms Theexperimentation was performed on Amazon EC2 cloudOverall Spark showed the best performance when iterativeoperations were performed in main memory In contrast theperformance of Hadoop was significantly poor as comparedto other resource management systems

35 Latency Big data and low latency are strongly linkedBig data applications provide true value to businesses butthese are mostly time critical If cloud computing has to bethe successful platform for big data implementation one ofthe key requirements will be the provisioning of high-speednetwork to reduce communication latency Furthermore bigdata frameworks usually involve centralized design wherethe scheduler assigns all tasks through a single node whichmay significantly impact the latency when the size of data ishuge

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 8: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

8 Scientific Programming

Table 2 Comparative analysis of big data resource frameworks (119904 = 5)

Hadoop Spark Flink Storm SamzaProcessing speed permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermilFault tolerance permilpermilpermilpermil permilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilScalability permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil permilpermilpermilMachine learning permilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermilpermilLow latency permilpermil permilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermilpermilSecurity permilpermilpermilpermil permilpermilpermilpermilpermil permilpermilpermilpermil permilpermilpermilpermil NDataset size permilpermilpermilpermilpermil permilpermilpermil permilpermilpermilpermil permilpermilpermil permilpermilpermil

Task

Task

Task

Task

Executer Executer

Worker ProcessSupervisor

Worker node

Worker Process

Worker ProcessSupervisor

Worker node

Workers

NimbusMaster Node

Storm Cluster

Zookeeper Framework

Figure 5 Architecture of Storm Cluster source [17]

Kafka

Task

Task Task

YARN NodeManager

Machine

Samza Container

Samza Container

Samza Job

KafkaTopic 1

KafkaTopic 2

KafkaTopic 3

Figure 6 Samza architecture source [18]

proceeds with the output from the previous stage andgenerates an input for the next stage Although machinelearning algorithms are mostly designed in the form of cyclicdata flow Spark Storm and Samza represent it as directedacyclic graph to optimize the execution plan Flink supportscontrolled cyclic dependency graph in runtime to representthe machine learning algorithms in a very efficient way

Hadoop and Storm do not provide any default interactiveenvironment Apache Spark has a command-line interactiveshell to use the application features Flink provides a Scalashell to configure standalone as well as cluster setup ApacheHadoop is highly scalable and it has been used in theYahoo production consisting of 42000 nodes in 20 YARNclusters The largest known cluster size for Spark is of 8000computing nodes while Stormhas been tested on amaximumof 300 node clusters Apache Samza cluster with around ahundred nodes has been used in LinkedIn data flow andapplication messaging system Apache Flink has been custo-mized for Alibaba search engine with a deployment capacityof thousands of processing nodes

3 Comparative Evaluation ofBig Data Frameworks

Big data in cloud computing a popular research trendis posing significant influence on current enterprises ITindustries and research communities There are a numberof disruptive and transformative big data technologies andsolutions that are rapidly emanating and evolving in orderto provide data-driven insight and innovation Furthermoremodern cloud computing services are offering all kinds ofbig data analytic tools technologies and computing infras-tructures to speed up the data analysis process at an afford-able cost Although many distributed resource managementframeworks are available nowadays the main issue is how to

Scientific Programming 9

select a suitable big data framework The selection of one bigdata platform over the others will come down to the specificapplication requirements and constraints that may involveseveral tradeoffs and application usage scenarios Howeverwe can identify some key factors that need to be fulfilledbefore deploying a big data application in the cloud In thissection based on some empirical evidence from the availableliterature we discuss the advantages and disadvantages ofeach resource management framework

31 Processing Speed Processing speed is an important per-formance measurement that may be used to evaluate theeffectiveness of different resource management frameworksIt is a commonmetric for the maximum number of IO oper-ations to disk ormemory or the data transfer rate between thecomputational units of the cluster over a specific amount oftime Based on the context of big data the average processingspeed represented as119898 calculated after 119899 iterations run is themaximum amount of memorydisk intensive operations thatcan be performed over a time interval 119905119894

119898 = sum119899119894=1119898119894sum119899119894=1 119905119894 (1)

Veiga et al [30] conducted a series of experiments on amulticore cluster setup to demonstrate performance resultsof Apache Hadoop Spark and Flink Apache Spark andFlink resulted to be much efficient execution platforms overHadoop while performing nonsort benchmarks It was fur-ther noted that Spark showed better performance results foroperations such asWordCount and119870-Means (CPU-bound innature) while Flink achieved better results in PageRank algo-rithm (memory bound in nature) Mavridis and Karatza [31]experimentally compared performance statistics of ApacheHadoop and Spark onOkeanos IaaS cloud platform For eachset of experiments necessary statistics related to executiontime working nodes and the dataset size were recordedSpark performance was found optimal as compared toHadoop for most of the cases Furthermore Spark on YARNplatform showed suboptimal results as compared to the casewhen it was executed in standalone mode Some similarresults were also observed by Zaharia et al [32] on a 100GBdataset record Vellaipandiyan and Raja [33] demonstratedperformance evaluation and comparison of Hadoop andSpark frameworks on residentrsquos record dataset ranging from100GB to 900GB of size Spark scale of performance wasrelatively better when the dataset size was between small andmedium size (100GBndash750GB) afterwards its performancedeclined as compared to HadoopThe primary reason for theperformance decline was evident as Spark cache size couldnot fit into the memory for the larger dataset Taran et al[34] quantified performance differences ofHadoop and Sparkusing WordCount dataset which was ranging from 100KBto 1GB It was observed that Hadoop framework was fivetimes faster than Spark when the evaluation was performedusing a larger set of data sources However for the smallertasks Spark showed better performance results However thespeed-up ratio was decreased for both databases with thegrowth of input dataset

Gopalani and Arora [35] used 119870-Means algorithm onsome small medium and large location datasets to compareHadoop and Spark frameworksThe study results showed thatSpark performed up to three times better than MapReducefor most of the cases Bertoni et al [36] performed theexperimental evaluation of Apache Flink and Storm usinglarge genomic dataset data on Amazon EC2 cloud ApacheFlink was superior to Storm while performing histogramand map operations while Storm outperformed Flink whilegenomic join application was deployed

32 Fault Tolerance Fault tolerance is the characteristicthat enables a system to continue functioning in case ofthe failure of one or more components High-performancecomputing applications involve hundreds of nodes that areinterconnected to perform a specific task failing a nodeshould have zero or minimal effect on overall computationThe tolerance of a system represented as TolFT to meet itsrequirements after a disruption is the ratio of the time tocomplete tasks without observing any fault events to theoverall execution time where some fault events were detectedand the system state is reverted back to consistent state

TolFT =119879119909119879119909 + 1205902 (2)

where 119879119909 is the estimated correct execution time obtainedfrom a program run that is presumed to be fault-free or byaveraging the execution time from several application runsthat produce a known correct output and 1205902 represents thevariance in a programrsquos execution time due to the occurrenceof fault events For anHPC application that consists of a set ofcomputationally intensive tasks Γ = 1205911 1205912 120591119899 and sinceTolFT for each individual task is computed as Tol120591119894 then theoverall application resilience Tol may be calculated as [37]

Tol = 119899radicTol1205911 sdot Tol1205912 sdot sdot Tol120591119899 (3)

Lu et al [38] used StreamBench toolkit to evaluate perfor-mance and fault tolerance ability of Apache Spark StormSpark and Samza It was found that with the increased datasize Spark is much stable and fault-tolerant as compared toStorm but may be less efficient when compared to SamzaFurthermore when compared in terms of handling largecapacity of data both Samza and Spark outperformed StormGu and Li [39] used PageRank algorithm to perform acomparative experiment on Hadoop and Spark frameworksIt was observed that for smaller datasets such as wiki-Voteand soc-Slashdot0902 Spark outperformed Hadoop with amuch better margin However this speed-up result degradedwith the growth of dataset and for large datasets Hadoopeasily outperformed Spark Furthermore for massively largedatasets Spark was reported to be crashed with JVM heapexception while Hadoop still performed its task Lopez et al[40] evaluated throughput and fault tolerance mechanism ofApache Storm and Flink The experiments were based on athreat detection system where Apache Storm demonstratedbetter throughput as compared to Flink For fault tolerance

10 Scientific Programming

different virtual machines were manually turned off to ana-lyze the impact of node failures Apache Flink used its internalsubsystem to detect and migrate the failed tasks to othermachines and hence resulted in very few message lossesOn the other hand Storm took more time as Zookeeperinvolving some performance overhead was responsible forreporting the state of nimbus and thereafter processing thefailed task on other nodes

33 Scalability Scalability refers to the ability to accommo-date large loads or change in sizeworkload by provisioning ofresources at runtimeThis can be further categorized as scale-up (by making hardware stronger) or scale-down (by addingadditional nodes) One of the critical requirements of enter-prises is to process large volumes of data in a timely mannerto address high-value business problems Dynamic resourcescalability allows business entities to perform massive com-putation in parallel thus reducing overall time complexityand effort The definition of scalability comes from Amdahlrsquosand Gustafsonrsquos laws [41] Let 119882 be the size of workloadbefore the improvement of the system resources the fractionof the execution workload that benefits from the improve-ment of system resources is 120572 and the fraction concerningthe part that would not benefit from improvement in theresources is 1 minus 120572 When using an 119899-processor system userworkload is scaled to

= 120572119882 + (1 minus 120572) 119899119882 (4)

The parallel execution time of a scaled workload on 119899-processors is defined as scaled-workload speed-up 119878 as shownin

119878 = 119882 =120572119882 + (1 minus 120572) 119899119882119882 (5)

Garcıa-Gil et al [42] performed scalability and performancecomparison of Apache Spark and Flink using feature selec-tion framework to assemble multiple information theoreticcriteria into a single greedy algorithm The ECBDL14 datasetwas used to measure scalability factor for the frameworksIt was observed that Spark scalability performance was 4ndash10times faster than Flink Jakovits and Srirama [43] analyzedfour MapReduce based frameworks including Hadoop andSpark for benchmarking partitioning aroundMedoids Clus-tering Large Applications and Conjugate Gradient linearsystem solver algorithms using MPI All experiments wereperformed on 33 Amazon EC2 large instances cloud Forall algorithms Spark performed much better as comparedto Hadoop in terms of both performance and scalabilityBoden et al [44] aimed to investigate the scalability withrespect to both data size and dimensionality in order todemonstrate a fair and insightful benchmark that reflects therequirements of real-world machine learning applicationsThe benchmark was comprised of distributed optimizationalgorithms for supervised learning as well as algorithms forunsupervised learning For supervised learning they imple-mented machine learning algorithms by using Breeze librarywhile for unsupervised learning they chose the popular 119896-Means clustering algorithm in order to assess the scalability

of the two resource management frameworks The overallexecution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes whileSpark had a clear edge once enough main memory wasavailable due to the addition of new computing nodes

34 Machine Learning and Iterative Tasks Support Big dataapplications are inherently complex in nature and usuallyinvolve tasks and algorithms that are iterative in natureTheseapplications have distinct cyclic nature to achieve the desiredresult by continually repeating a set of tasks until these cannotbe substantially reduced further

Spangenberg et al [45] used real-world datasets consist-ing of four algorithms that is WordCount119870-Means PageR-ank and relational query to benchmark Apache Flink andStorm It was observed that Apache Storm performs better inbatchmode as compared to Flink However with the increas-ing complexity Apache Flink had a performance advantageover Storm and thus it was better suited for iterative dataor graph processing Shi et al [46] focused on analyz-ing Apache MapReduce and Spark for batch and iterativejobs For smaller datasets Spark resulted to be a better choicebut when experiments were performed on larger datasetsMapReduce turned out to be several times faster than SparkFor iterative operations such as 119870-Means Spark turned outto be 15 times faster as compared to MapReduce in its firstiteration while Spark was more than 5 times faster in sub-sequent operations

Kang and Lee [47] examined five resource manage-ment frameworks including Apache Hadoop and Sparkwith respect to performance overheads (disk inputoutputnetwork communication scheduling etc) in supportingiterative computation The PageRank algorithm was used toevaluate these performance issues Since static data process-ing tends to be a more frequent operation than dynamic dataas it is used in every iteration of MapReduce it may causesignificant performance overhead in case of MapReduceOn the other hand Apache Spark uses read-only cachedversion of objects (resilient distributed dataset) which can bereused in parallel operations thus reducing the performanceoverhead during iterative computation Lee et al [48] evalu-ated five systems including Hadoop and Spark over variousworkloads to compare against four iterative algorithms Theexperimentation was performed on Amazon EC2 cloudOverall Spark showed the best performance when iterativeoperations were performed in main memory In contrast theperformance of Hadoop was significantly poor as comparedto other resource management systems

35 Latency Big data and low latency are strongly linkedBig data applications provide true value to businesses butthese are mostly time critical If cloud computing has to bethe successful platform for big data implementation one ofthe key requirements will be the provisioning of high-speednetwork to reduce communication latency Furthermore bigdata frameworks usually involve centralized design wherethe scheduler assigns all tasks through a single node whichmay significantly impact the latency when the size of data ishuge

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 9: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

Scientific Programming 9

select a suitable big data framework The selection of one bigdata platform over the others will come down to the specificapplication requirements and constraints that may involveseveral tradeoffs and application usage scenarios Howeverwe can identify some key factors that need to be fulfilledbefore deploying a big data application in the cloud In thissection based on some empirical evidence from the availableliterature we discuss the advantages and disadvantages ofeach resource management framework

31 Processing Speed Processing speed is an important per-formance measurement that may be used to evaluate theeffectiveness of different resource management frameworksIt is a commonmetric for the maximum number of IO oper-ations to disk ormemory or the data transfer rate between thecomputational units of the cluster over a specific amount oftime Based on the context of big data the average processingspeed represented as119898 calculated after 119899 iterations run is themaximum amount of memorydisk intensive operations thatcan be performed over a time interval 119905119894

119898 = sum119899119894=1119898119894sum119899119894=1 119905119894 (1)

Veiga et al [30] conducted a series of experiments on amulticore cluster setup to demonstrate performance resultsof Apache Hadoop Spark and Flink Apache Spark andFlink resulted to be much efficient execution platforms overHadoop while performing nonsort benchmarks It was fur-ther noted that Spark showed better performance results foroperations such asWordCount and119870-Means (CPU-bound innature) while Flink achieved better results in PageRank algo-rithm (memory bound in nature) Mavridis and Karatza [31]experimentally compared performance statistics of ApacheHadoop and Spark onOkeanos IaaS cloud platform For eachset of experiments necessary statistics related to executiontime working nodes and the dataset size were recordedSpark performance was found optimal as compared toHadoop for most of the cases Furthermore Spark on YARNplatform showed suboptimal results as compared to the casewhen it was executed in standalone mode Some similarresults were also observed by Zaharia et al [32] on a 100GBdataset record Vellaipandiyan and Raja [33] demonstratedperformance evaluation and comparison of Hadoop andSpark frameworks on residentrsquos record dataset ranging from100GB to 900GB of size Spark scale of performance wasrelatively better when the dataset size was between small andmedium size (100GBndash750GB) afterwards its performancedeclined as compared to HadoopThe primary reason for theperformance decline was evident as Spark cache size couldnot fit into the memory for the larger dataset Taran et al[34] quantified performance differences ofHadoop and Sparkusing WordCount dataset which was ranging from 100KBto 1GB It was observed that Hadoop framework was fivetimes faster than Spark when the evaluation was performedusing a larger set of data sources However for the smallertasks Spark showed better performance results However thespeed-up ratio was decreased for both databases with thegrowth of input dataset

Gopalani and Arora [35] used 119870-Means algorithm onsome small medium and large location datasets to compareHadoop and Spark frameworksThe study results showed thatSpark performed up to three times better than MapReducefor most of the cases Bertoni et al [36] performed theexperimental evaluation of Apache Flink and Storm usinglarge genomic dataset data on Amazon EC2 cloud ApacheFlink was superior to Storm while performing histogramand map operations while Storm outperformed Flink whilegenomic join application was deployed

32 Fault Tolerance Fault tolerance is the characteristicthat enables a system to continue functioning in case ofthe failure of one or more components High-performancecomputing applications involve hundreds of nodes that areinterconnected to perform a specific task failing a nodeshould have zero or minimal effect on overall computationThe tolerance of a system represented as TolFT to meet itsrequirements after a disruption is the ratio of the time tocomplete tasks without observing any fault events to theoverall execution time where some fault events were detectedand the system state is reverted back to consistent state

TolFT =119879119909119879119909 + 1205902 (2)

where 119879119909 is the estimated correct execution time obtainedfrom a program run that is presumed to be fault-free or byaveraging the execution time from several application runsthat produce a known correct output and 1205902 represents thevariance in a programrsquos execution time due to the occurrenceof fault events For anHPC application that consists of a set ofcomputationally intensive tasks Γ = 1205911 1205912 120591119899 and sinceTolFT for each individual task is computed as Tol120591119894 then theoverall application resilience Tol may be calculated as [37]

Tol = 119899radicTol1205911 sdot Tol1205912 sdot sdot Tol120591119899 (3)

Lu et al [38] used StreamBench toolkit to evaluate perfor-mance and fault tolerance ability of Apache Spark StormSpark and Samza It was found that with the increased datasize Spark is much stable and fault-tolerant as compared toStorm but may be less efficient when compared to SamzaFurthermore when compared in terms of handling largecapacity of data both Samza and Spark outperformed StormGu and Li [39] used PageRank algorithm to perform acomparative experiment on Hadoop and Spark frameworksIt was observed that for smaller datasets such as wiki-Voteand soc-Slashdot0902 Spark outperformed Hadoop with amuch better margin However this speed-up result degradedwith the growth of dataset and for large datasets Hadoopeasily outperformed Spark Furthermore for massively largedatasets Spark was reported to be crashed with JVM heapexception while Hadoop still performed its task Lopez et al[40] evaluated throughput and fault tolerance mechanism ofApache Storm and Flink The experiments were based on athreat detection system where Apache Storm demonstratedbetter throughput as compared to Flink For fault tolerance

10 Scientific Programming

different virtual machines were manually turned off to ana-lyze the impact of node failures Apache Flink used its internalsubsystem to detect and migrate the failed tasks to othermachines and hence resulted in very few message lossesOn the other hand Storm took more time as Zookeeperinvolving some performance overhead was responsible forreporting the state of nimbus and thereafter processing thefailed task on other nodes

33 Scalability Scalability refers to the ability to accommo-date large loads or change in sizeworkload by provisioning ofresources at runtimeThis can be further categorized as scale-up (by making hardware stronger) or scale-down (by addingadditional nodes) One of the critical requirements of enter-prises is to process large volumes of data in a timely mannerto address high-value business problems Dynamic resourcescalability allows business entities to perform massive com-putation in parallel thus reducing overall time complexityand effort The definition of scalability comes from Amdahlrsquosand Gustafsonrsquos laws [41] Let 119882 be the size of workloadbefore the improvement of the system resources the fractionof the execution workload that benefits from the improve-ment of system resources is 120572 and the fraction concerningthe part that would not benefit from improvement in theresources is 1 minus 120572 When using an 119899-processor system userworkload is scaled to

= 120572119882 + (1 minus 120572) 119899119882 (4)

The parallel execution time of a scaled workload on 119899-processors is defined as scaled-workload speed-up 119878 as shownin

119878 = 119882 =120572119882 + (1 minus 120572) 119899119882119882 (5)

Garcıa-Gil et al [42] performed scalability and performancecomparison of Apache Spark and Flink using feature selec-tion framework to assemble multiple information theoreticcriteria into a single greedy algorithm The ECBDL14 datasetwas used to measure scalability factor for the frameworksIt was observed that Spark scalability performance was 4ndash10times faster than Flink Jakovits and Srirama [43] analyzedfour MapReduce based frameworks including Hadoop andSpark for benchmarking partitioning aroundMedoids Clus-tering Large Applications and Conjugate Gradient linearsystem solver algorithms using MPI All experiments wereperformed on 33 Amazon EC2 large instances cloud Forall algorithms Spark performed much better as comparedto Hadoop in terms of both performance and scalabilityBoden et al [44] aimed to investigate the scalability withrespect to both data size and dimensionality in order todemonstrate a fair and insightful benchmark that reflects therequirements of real-world machine learning applicationsThe benchmark was comprised of distributed optimizationalgorithms for supervised learning as well as algorithms forunsupervised learning For supervised learning they imple-mented machine learning algorithms by using Breeze librarywhile for unsupervised learning they chose the popular 119896-Means clustering algorithm in order to assess the scalability

of the two resource management frameworks The overallexecution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes whileSpark had a clear edge once enough main memory wasavailable due to the addition of new computing nodes

34 Machine Learning and Iterative Tasks Support Big dataapplications are inherently complex in nature and usuallyinvolve tasks and algorithms that are iterative in natureTheseapplications have distinct cyclic nature to achieve the desiredresult by continually repeating a set of tasks until these cannotbe substantially reduced further

Spangenberg et al [45] used real-world datasets consist-ing of four algorithms that is WordCount119870-Means PageR-ank and relational query to benchmark Apache Flink andStorm It was observed that Apache Storm performs better inbatchmode as compared to Flink However with the increas-ing complexity Apache Flink had a performance advantageover Storm and thus it was better suited for iterative dataor graph processing Shi et al [46] focused on analyz-ing Apache MapReduce and Spark for batch and iterativejobs For smaller datasets Spark resulted to be a better choicebut when experiments were performed on larger datasetsMapReduce turned out to be several times faster than SparkFor iterative operations such as 119870-Means Spark turned outto be 15 times faster as compared to MapReduce in its firstiteration while Spark was more than 5 times faster in sub-sequent operations

Kang and Lee [47] examined five resource manage-ment frameworks including Apache Hadoop and Sparkwith respect to performance overheads (disk inputoutputnetwork communication scheduling etc) in supportingiterative computation The PageRank algorithm was used toevaluate these performance issues Since static data process-ing tends to be a more frequent operation than dynamic dataas it is used in every iteration of MapReduce it may causesignificant performance overhead in case of MapReduceOn the other hand Apache Spark uses read-only cachedversion of objects (resilient distributed dataset) which can bereused in parallel operations thus reducing the performanceoverhead during iterative computation Lee et al [48] evalu-ated five systems including Hadoop and Spark over variousworkloads to compare against four iterative algorithms Theexperimentation was performed on Amazon EC2 cloudOverall Spark showed the best performance when iterativeoperations were performed in main memory In contrast theperformance of Hadoop was significantly poor as comparedto other resource management systems

35 Latency Big data and low latency are strongly linkedBig data applications provide true value to businesses butthese are mostly time critical If cloud computing has to bethe successful platform for big data implementation one ofthe key requirements will be the provisioning of high-speednetwork to reduce communication latency Furthermore bigdata frameworks usually involve centralized design wherethe scheduler assigns all tasks through a single node whichmay significantly impact the latency when the size of data ishuge

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 10: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

10 Scientific Programming

different virtual machines were manually turned off to ana-lyze the impact of node failures Apache Flink used its internalsubsystem to detect and migrate the failed tasks to othermachines and hence resulted in very few message lossesOn the other hand Storm took more time as Zookeeperinvolving some performance overhead was responsible forreporting the state of nimbus and thereafter processing thefailed task on other nodes

33 Scalability Scalability refers to the ability to accommo-date large loads or change in sizeworkload by provisioning ofresources at runtimeThis can be further categorized as scale-up (by making hardware stronger) or scale-down (by addingadditional nodes) One of the critical requirements of enter-prises is to process large volumes of data in a timely mannerto address high-value business problems Dynamic resourcescalability allows business entities to perform massive com-putation in parallel thus reducing overall time complexityand effort The definition of scalability comes from Amdahlrsquosand Gustafsonrsquos laws [41] Let 119882 be the size of workloadbefore the improvement of the system resources the fractionof the execution workload that benefits from the improve-ment of system resources is 120572 and the fraction concerningthe part that would not benefit from improvement in theresources is 1 minus 120572 When using an 119899-processor system userworkload is scaled to

= 120572119882 + (1 minus 120572) 119899119882 (4)

The parallel execution time of a scaled workload on 119899-processors is defined as scaled-workload speed-up 119878 as shownin

119878 = 119882 =120572119882 + (1 minus 120572) 119899119882119882 (5)

Garcıa-Gil et al [42] performed scalability and performancecomparison of Apache Spark and Flink using feature selec-tion framework to assemble multiple information theoreticcriteria into a single greedy algorithm The ECBDL14 datasetwas used to measure scalability factor for the frameworksIt was observed that Spark scalability performance was 4ndash10times faster than Flink Jakovits and Srirama [43] analyzedfour MapReduce based frameworks including Hadoop andSpark for benchmarking partitioning aroundMedoids Clus-tering Large Applications and Conjugate Gradient linearsystem solver algorithms using MPI All experiments wereperformed on 33 Amazon EC2 large instances cloud Forall algorithms Spark performed much better as comparedto Hadoop in terms of both performance and scalabilityBoden et al [44] aimed to investigate the scalability withrespect to both data size and dimensionality in order todemonstrate a fair and insightful benchmark that reflects therequirements of real-world machine learning applicationsThe benchmark was comprised of distributed optimizationalgorithms for supervised learning as well as algorithms forunsupervised learning For supervised learning they imple-mented machine learning algorithms by using Breeze librarywhile for unsupervised learning they chose the popular 119896-Means clustering algorithm in order to assess the scalability

of the two resource management frameworks The overallexecution time of Flink was relatively low on the resource-constrained settings with a limited number of nodes whileSpark had a clear edge once enough main memory wasavailable due to the addition of new computing nodes

34 Machine Learning and Iterative Tasks Support Big dataapplications are inherently complex in nature and usuallyinvolve tasks and algorithms that are iterative in natureTheseapplications have distinct cyclic nature to achieve the desiredresult by continually repeating a set of tasks until these cannotbe substantially reduced further

Spangenberg et al [45] used real-world datasets consist-ing of four algorithms that is WordCount119870-Means PageR-ank and relational query to benchmark Apache Flink andStorm It was observed that Apache Storm performs better inbatchmode as compared to Flink However with the increas-ing complexity Apache Flink had a performance advantageover Storm and thus it was better suited for iterative dataor graph processing Shi et al [46] focused on analyz-ing Apache MapReduce and Spark for batch and iterativejobs For smaller datasets Spark resulted to be a better choicebut when experiments were performed on larger datasetsMapReduce turned out to be several times faster than SparkFor iterative operations such as 119870-Means Spark turned outto be 15 times faster as compared to MapReduce in its firstiteration while Spark was more than 5 times faster in sub-sequent operations

Kang and Lee [47] examined five resource manage-ment frameworks including Apache Hadoop and Sparkwith respect to performance overheads (disk inputoutputnetwork communication scheduling etc) in supportingiterative computation The PageRank algorithm was used toevaluate these performance issues Since static data process-ing tends to be a more frequent operation than dynamic dataas it is used in every iteration of MapReduce it may causesignificant performance overhead in case of MapReduceOn the other hand Apache Spark uses read-only cachedversion of objects (resilient distributed dataset) which can bereused in parallel operations thus reducing the performanceoverhead during iterative computation Lee et al [48] evalu-ated five systems including Hadoop and Spark over variousworkloads to compare against four iterative algorithms Theexperimentation was performed on Amazon EC2 cloudOverall Spark showed the best performance when iterativeoperations were performed in main memory In contrast theperformance of Hadoop was significantly poor as comparedto other resource management systems

35 Latency Big data and low latency are strongly linkedBig data applications provide true value to businesses butthese are mostly time critical If cloud computing has to bethe successful platform for big data implementation one ofthe key requirements will be the provisioning of high-speednetwork to reduce communication latency Furthermore bigdata frameworks usually involve centralized design wherethe scheduler assigns all tasks through a single node whichmay significantly impact the latency when the size of data ishuge

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 11: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

Scientific Programming 11

Let119879elapsed be the elapsed time between the start andfinishtime of a program in a distributed architecture 119879119894 be theeffective execution time and 120582119894 be the sum of total idle unitsof 119894th processor from a set of119873 processorsThen the averagelatency represented as 120582(119882119873) for the size of workload119882is defined as the average amount of overhead time needed foreach processor to complete the task

120582 (119882119873) =sum119873119894=1 (119879elapsed minus 119879119894 + 120582119894)

119873 (6)

Chintapalli et al [49] conducted a detailed analysis ofApache Storm Flink and Spark streaming engines for latencyand throughput The study results indicated that for highthroughput Flink and Storm have significantly lower latencyas compared to Spark However Spark was able to handlehigh throughput as compared to other streaming engines Luet al [38] proposed StreamBench benchmark framework toevaluate modern distributed stream processing frameworksThe framework includes dataset selection data generationmethodologies program set description workload suitesdesign and metric proposition Two real-world datasetsAOL Search Data and CAIDA Anonymized Internet TracesDataset were used to assess performance scalability andfault tolerance aspects of streaming frameworks It wasobserved that Stormrsquos latency in most cases was far less thanSparkrsquos except in the case when the scales of the workload anddataset were massive in nature

36 Security Instead of the classical HPC environmentwhere information is stored in-house many big data appli-cations are now increasingly deployed on the cloud whereprivacy-sensitive information may be accessed or recordedby different data users with ease Although data privacyand security issues are not a new topic in the area of dis-tributed computing their importance is amplified by thewideadoption of cloud computing services for big data platformThe dataset may be exposed to multiple users for differentpurposes which may lead to security and privacy risks

Let119873 be a list of security categories that may be providedfor security mechanism For instance a framework may useencryption mechanism to provide data security and accesscontrol list for authentication and authorization services Letmax(119882119894) be the maximum weight that is assigned to the119894th security category from a list of 119873 categories and 119882119894 bethe reputation score of a particular resource managementframework Then the framework security ranking score canbe represented as

Secscore =sum119873119894=1119882119894sum119873119894=1max (119882119894)

(7)

Hadoop and Storm use Kerberos authentication protocolfor computing nodes to provide their identity [50] Sparkadopts a password based shared secret configuration as wellas Access Control Lists (ACLs) to control the authenticationand authorization mechanisms In Flink stream brokers areresponsible for providing authentication mechanism acrossmultiple services Apache SamzaKafka provides no built-insecurity at the system level

37 Dataset Size Support Many scientific applications scaleup to hundreds of nodes to process massive amount of datathat may exceed over hundreds of terabytes Unless big dataapplications are properly optimized for larger datasets thismay result in performance degradationwith the growth of thedata Furthermore in many cases this may result in a crashof software resulting in loss of time and money We usedthe same methodology to collect big data support statisticsas presented in earlier sections

4 Discussion on Big Data Framework

Every big data framework has been evolved for its uniquedesign characteristics and application-specific requirementsBased on the key factors and their empirical evidence toevaluate big data frameworks as discussed in Section 3 weproduce the summary of results of seven evaluation factorsin the form of star ranking RankingRF isin [0 119904] where 119904 rep-resents the maximum evaluation score The general equationto produce the ranking for each resource framework (RF) isgiven as

RankingRF =10038171003817100381710038171003817100381710038171003817100381710038171003817

119904sum7119894=1sum119873119895=1max (119882119894119895)

lowast7

sum119894=1

119873

sum119895=1

11988211989411989510038171003817100381710038171003817100381710038171003817100381710038171003817 (8)

where 119873 is the total number of research studies for aparticular set of evaluation metrics max(119882119894119895) isin [0 1] isthe relative normalized weight assigned to each literaturestudy based on the number of experiments performed and119882119894119895 is the framework test-bed score calculated from theexperimentation results from each study

Hadoop MapReduce has a clear edge on large-scaledeployment and larger dataset processing Hadoop is highlycompatible and interoperable with other frameworks It alsooffers a reliable fault tolerance mechanism to provide afailure-free mechanism over a long period of time Hadoopcan operate on a low-cost configuration However Hadoopis not suitable for real-time applications It has a significantdisadvantage when latency throughput and iterative jobsupport for machine learning are the key considerations ofapplication requirements

Apache Spark is designed to be a replacement for batch-oriented Hadoop ecosystem to run-over static and real-timedatasets It is highly suitable for high throughput streamingapplications where latency is not a major issue Spark ismemory intensive and all operations take place in memoryAs a result it may crash if enough memory is not availablefor further operations (before the release of Spark version 15it was not capable of handling datasets larger than the size ofRAM and the problem of handling larger dataset still persistsin the newer releases with different performance overheads)Few research efforts such as Project Tungsten are aimedat addressing the efficiency of memory and CPU for Sparkapplications Spark also lacks its own storage system so itsintegration with HDFS through YARN or Cassandra usingMesos is an extra overhead for cluster configuration

Apache Flink is a true streaming engine Flink supportsboth batch and real-time operations over a common runtimeto fulfill the requirements of Lambda architecture However

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 12: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

12 Scientific Programming

it may also work in batch mode by stopping the streamingsource Like Spark Flink performs all operations in memorybut in case of memory hog it may also use disk storage toavoid application failure Flink has some major advantagesover Hadoop and Spark by providing better support foriterative processing with high throughput at the cost of lowlatency

Apache Storm was designed to provide a scalable faulttolerance real-time streaming engine for data analysis whichHadoop did for batch processing However the empiricalevidence suggests that Apache Storm proved to be inefficientto meet the scale-upscale-down requirements for real-timebig data applications Furthermore since it uses microbathstream processing it is not very efficient where continuousstream process is a major concern nor does it provide amechanism for simple batch processing For fault toleranceStormuses Zookeeper to store the state of the processeswhichmay involve some extra overhead and may also result inmessage loss On the other hand Storm is an ideal solutionfor near-real-time application processing where workloadcould be processed with a minimal delay with strict latencyrequirements

Apache Samza in integration Kafka provides someunique features that are not offered by other stream pro-cessing engines Samza provides a powerful check-pointingbased fault tolerance mechanism with minimal data lossSamza jobs can have high throughput with low latency whenintegrated with Kafka However Samza lacks some importantfeatures as data processing engine Furthermore it offers nobuilt-in security mechanism for data access control

To categorize the selection of best resource engine basedon a particular set of requirements we use the frameworkproposed by Chung et al [51] The framework provides alayout for matching ranking and selecting a system basedon some particular requirements The matching criterion isbased on user goals which are categorized as soft and hardgoals Soft goals represent the nonfunctional requirements ofthe system (such as security fault tolerance and scalability)while hard goals represent the functional aspects of thesystem (such as machine learning and data size support)Therelationship between the system components and the goalscan be ranked as very positive (++) positive (+) negative(minus) and very negative (minusminus) Based on the evidence providedfrom the literature the categorization of major resourcemanagement frameworks is presented in Figure 7

5 Related Work

Our research work differs from other efforts because thesubject goal and object of study are not identical as weprovide an in-depth comparison of popular resource enginesbased on empirical evidence from existing literature Hesseand Lorenz [8] conducted a conceptual survey on streamprocessing systems However their discussion was focusedon some basic differences related to real-time data processingengines Singh and Reddy [9] provided a thorough analysisof big data analytic platforms that included peer-to-peernetworks field programmable gate arrays (FPGA) ApacheHadoop ecosystem high-performance computing (HPC)

clusters multicore CPU and graphics processing unit (GPU)Our case is different here as we are particularly interestedin big data processing engines Finally Landset et al [10]focused on machine learning libraries and their evaluationbased on ease of use scalability and extensibility Howeverour work differs from their work as the primary focuses ofboth studies are not identical

Chen and Zhang [11] discussed big data problemschallenges and associated techniques and technologies toaddress these issues Several potential techniques includingcloud computing quantum computing granular computingand biological computing were investigated and the possibleopportunities to explore these domains were demonstratedHowever the performance evaluation was discussed onlyon theoretical grounds A taxonomy and detailed analysis ofthe state of the art in big data 20 processing systems werepresented in [12] The focus of the study was to identify cur-rent research challenges and highlight opportunities for newinnovations and optimization for future research and devel-opment Assuncao et al [13] reviewedmultiple generations ofdata stream processing frameworks that providemechanismsfor resource elasticity to match the demands of streamprocessing services The study examined the challengesassociated with efficient resource management decisions andsuggested solutions derived from the existing research stud-ies However the study metrics are restricted to the elasti-cityscalability aspect of big data streaming frameworks

As shown in Table 3 our work differs from the previ-ous studies which focused on the classification of resourcemanagement frameworks on theoretical grounds In contrastto the earlier approaches we classify and categorize bigdata resource management frameworks based on empiricalgrounds derived from multiple evaluationexperimentationstudies Furthermore our evaluationranking methodologyis based on a comprehensive list of study variables whichwerenot addressed in the studies conducted earlier

6 Conclusions and Future Work

There are a number of disruptive and transformative bigdata technologies and solutions that are rapidly emanatingand evolving in order to provide data-driven insight andinnovation The primary object of the study was how toclassify popular big data resource management systems Thisstudy was also aimed at addressing the selection of candidateresource provider based on specific big data applicationrequirements We surveyed different big data resource man-agement frameworks and investigated the advantages anddisadvantages for each of them We carried out the perfor-mance evaluation of resource management engines based onseven key factors and each one of the frameworks was rankedbased on the empirical evidence from the literature

61 Observations and Findings Some key findings of thestudy are as follows

(i) In terms of processing speed Apache Flink outper-forms other resource management frameworks forsmall medium and large datasets [30 36] Howeverduring our own set of experiments on Amazon EC2

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 13: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

Scientific Programming 13

Table3Com

paris

onandapplicationareaso

frela

tedresearch

studies

Stud

yreference

Datam

odel

Resource

fram

eworks

Stud

yfeatures

Evaluatio

nrank

ingmetho

dology

[8]

Datas

tream

processin

gsyste

ms

StormFlin

kSparkSamza

Abriefcom

paris

onof

resource

fram

eworks

N

[9]

Batchandstr

eam

processin

gsyste

ms

Horizon

talscalin

gsyste

mssuch

aspeer-to

-peerMapRe

duceM

PIand

Sparkandverticalscalingsyste

mssuch

asCU

DAandHDL

Com

paris

onof

horiz

ontaland

vertical

scalingsyste

ms

Theoretic

alcomparis

onof

resource

fram

eworks

[10]

Batchandstr

eam

processin

gengines

MapRe

duceSparkFlin

kandStorm

aswell

asmachine

learning

librarie

sMachine

learning

librarie

sand

their

evaluatio

nmechanism

Perfo

rmance

comparis

onwith

respectto

machine

learning

toolkits

[11]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pStormand

otherb

igdata

fram

eworks

In-depth

analysisof

bigdata

oppo

rtun

ities

andchallenges

N

[12]

Batchandstr

eam

processin

gfram

eworks

Hadoo

pSparkStormFlin

kandTeza

swellasS

QLGraph

and

bulk

synchron

ousp

arallelm

odel

Analysis

ofcurrento

penresearch

challenges

inthefi

eldof

bigdataandthe

prom

ising

directions

forfuturer

esearch

N

[13]

Stream

processin

gengines

Apache

StormS4FlinkSamzaSpark

Stream

ingandTw

itter

Heron

Classifi

catio

nof

elasticity

metric

sfor

resource

allocatio

nstrategies

thatmeet

thed

emands

ofstream

processin

gservices

Evaluatio

nof

elasticity

scalin

gmetric

sforstre

amprocessin

gsyste

ms

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 14: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

14 Scientific Programming

Hadoop

Spark

Flink

Samza

MachineLearning

FaultTolerance Scalability Data Size Performance Security Latency

Iterative

Recoverability

Scalable

Large

Medium

High

AccessControl

Authentication

Low

+

++

++

++

++

++

++

++

Storm++

++

Soft goal (non-functional)

Hard goal (functional)

And Decomposition

Figure 7 Categorization of resource engines based on key big data application requirements

cluster with varied task managers settings (1ndash4 taskmanagers per node) Flink failed to complete customsmaller size JVM dataset jobs due to inefficient mem-orymanagement of FlinkmemorymanagerWe couldnot find any reported evidence of this particular usecase in these relevant literature studies It seems thatmost of the performance evaluation studies employedstandard benchmarking test sets where dataset sizewas relatively large and hence this particular use casewas not reported in these studies Further researcheffort is required to elucidate the underlying specificof factors under this particular case

(ii) Big data applications usually involve massive amountof data Apache Spark supports necessary strate-gies for fault tolerance mechanism but it has beenreported to crash on larger datasets Even during

our experimentations Apache Spark version 16 (sel-ected due to the compatibility reasons with earlierresearches) crashed on several occasions when thedataset was larger than 500GB Although Spark hasbeen ranked higher in terms of fault tolerance withthe increase of data scale in several studies [38 52]it has limitations in handling larger typical big datadataset applications and hence such studies cannot begeneralized

(iii) Spark MLlib and Flink-ML offer a variety of machinelearning algorithms and utilities to exploit distributedand scalable big data applications Spark MLlib out-performs Flink-ML in most of the machine learninguse cases [42] except in the casewhen repeated passesare performed on unchanged data However thisperformance evaluation may further be investigated

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 15: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

Scientific Programming 15

as research studies such as [53] reported differentlywhere Flink outperformed Spark on sufficiently largecases

(iv) For graph processing algorithms such as PageRankFlink uses Gelly library that provides native closed-loop iteration operators making it a suitable platformfor large-scale graph analytics Spark on the otherhand uses GraphX library that has much longer pre-processing time to build graph and other data struc-tures making its performance worse as comparedto Apache Flink Apache Flink has been reported toobtain the best results for graph processing datasets(3xndash5x in [54] and 2x-3x in [45]) as compared toSpark However some studies such as [55] reportedSpark to be 17x faster than Flink for large graph pro-cessing Such inconsistent behavior may be furtherinvestigated in the future research studies

62 Future Work In the area of big data there is still a cleargap that requiresmore effort from the research community tobuild in-depth understanding of performance characteristicsof big data resource management frameworks We considerthis study a step towards enlarging our knowledge to under-stand the big data world and provide an effort towards thedirection of improving the state of the art and achieving thebig vision on the big data domain In earlier studies a clearranking cannot be established as the study parameters weremostly limited to a few issues such as throughput latencyand machine learning Furthermore further investigationis required on resource engines such as Apache Samza incomparison with other frameworks In addition researcheffort needs to be carried out in several areas such as dataorganization platform specific tools and technological issuesin big data domain in order to create next-generation big datainfrastructures The performance evaluation factors mightalso vary among systems depending on the used algorithmsAs future work we plan to benchmark these popular resourceengines for meeting resource demands and requirementsfor different scientific applications Moreover a scalabilityanalysis could be done Particularly the performance eval-uation of adding dynamic worker nodes and the resultingperformance analysis is of peculiar interest Additionallyfurther research can be carried out in order to evaluateperformance aspects with respect to resource competitionbetween jobs (on different research schedulers such as YARNand Mesos) and the fluctuation of available computingresources Finally most of the experimentations in earlierstudies were performed using standard parameter config-urations however each resource management frameworkoffers domain specific tweaks and configuration optimizationmechanisms for meeting application-specific requirementsThe development of a benchmark suite that aims to findmaximum throughput based on configuration optimizationwould be an interesting direction of future research

Conflicts of Interest

The authors declare that they have no conflicts of interest

References

[1] ldquoBig Data in the Cloud Converging Technologies (August 11)rdquohttpswwwintelcomcontentdamwwwpublicemeadededocumentsproduct-briefsbig-data-cloud-technologies-briefpdf

[2] Q He J Yan R Kowalczyk H Jin and Y Yang ldquoLifetimeservice level agreement management with autonomous agentsfor services provisionrdquo Information Sciences vol 179 no 15 pp2591ndash2605 2009

[3] O Rana M Warnier T B Quillinan and F Brazier ldquoMonitor-ing and reputation mechanisms for service level agreementsrdquoin 5th International Workshop on Grid Economics and BusinessModels GECON rsquo08 vol 5206 of Lecture Notes in ComputerScience (including subseries LectureNotes inArtificial Intelligenceand Lecture Notes in Bioinformatics) Preface pp 125ndash139 2008

[4] N Yuhanna and M Gualtieri ldquoThe Forrester Wave Big DataHadoop Cloud Solutions Q2 2016 Elasticity Automationand Pay-As-You-Go Compel Enterprise Adoption of Hadoopin theCloud (June 20)rdquo httpsncmediaazureedgenetncmedia201605The Forrester Wave Big Dpdf

[5] R Arora ldquoAn introduction to big data high performance com-puting high-throughput computing and Hadooprdquo ConqueringBig Data with High Performance Computing pp 1ndash12 2016

[6] F Pop J Kołodziej and B D Martino Resource Managementfor Big Data Platforms Algorithms Modelling and High- Perfor-mance Computing Techniques Springer 2016

[7] M Gribaudo M Iacono and F Palmieri ldquoPerformance mod-eling of big data-oriented architecturesrdquo in Resource Manage-ment for Big Data Platforms Computer Communications andNetworks pp 3ndash34 Springer International Publishing Cham2016

[8] G Hesse and M Lorenz ldquoConceptual survey on data streamprocessing systemsrdquo inProceedings of the 2015 IEEE 21st Interna-tional Conference on Parallel and Distributed Systems (ICPADS)pp 797ndash802 Melbourne Australia December 2015

[9] D Singh and C K Reddy ldquoA survey on platforms for big dataanalyticsrdquo Journal of Big Data vol 2 Article ID 8 2014

[10] S Landset T M Khoshgoftaar A N Richter and T HasaninldquoA survey of open source tools for machine learning with bigdata in the Hadoop ecosystemrdquo Journal of Big Data vol 2 no1 Article ID 24 2015

[11] C L P Chen and C Y Zhang ldquoData-intensive applicationschallenges techniques and technologies a survey on Big DatardquoInformation Sciences vol 275 pp 314ndash347 2014

[12] F Bajaber R Elshawi O Batarfi A Altalhi A Barnawi andS Sakr ldquoBig data 20 processing systems taxonomy and openchallengesrdquo Journal of Grid Computing vol 14 no 3 pp 379ndash405 2016

[13] M D d Assuncao A d S Veith and R Buyya ldquoDistributeddata stream processing and edge computing a survey onresource elasticity and future directionsrdquo Journal ofNetwork andComputer Applications vol 103 pp 1ndash17 2018

[14] ldquoBig data working group big data taxonomy 2014rdquo[15] W Inoubli S Aridhi H Mezni M Maddouri and E M

Nguifo ldquoAn experimental survey on big data frameworksrdquoFuture Generation Computer Systems 2018

[16] H-K Kwon and K-K Seo ldquoA fuzzy AHP based multi-criteriadecision-making model to select a cloud servicerdquo InternationalJournal of Smart Home vol 8 no 3 pp 175ndash180 2014

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 16: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

16 Scientific Programming

[17] A Rezgui and S Rezgui ldquoA stochastic approach for virtualmachine placement in volunteer cloud federationsrdquo in Proceed-ings of the 2nd IEEE International Conference on Cloud Engi-neering IC2E 2014 pp 277ndash282 BostonMAUSAMarch 2014

[18] M Salama A Zeid A Shawish and X Jiang ldquoA novel QoS-based framework for cloud computing service provider selec-tionrdquo International Journal of Cloud Applications and Comput-ing vol 4 no 2 pp 48ndash72 2014

[19] G Mateescu W Gentzsch and C J Ribbens ldquoHybrid comput-ing-where hpc meets grid and cloud computingrdquo Future Gener-ation Computer Systems vol 27 no 5 pp 440ndash453 2011

[20] I Foster C Kesselman and S Tuecke ldquoThe anatomy of the gridenabling scalable virtual organizationsrdquo International Journalof High Performance Computing Applications vol 15 no 3 pp200ndash222 2001

[21] S Benedict ldquoPerformance issues and performance analysistools for HPC cloud applications a surveyrdquo Computing vol 95no 2 pp 89ndash108 2013

[22] E C Inacio and M A R Dantas ldquoA survey into performanceand energy efficiency in HPC cloud and big data environ-mentsrdquo International Journal of Networking and Virtual Organ-isations vol 14 no 4 pp 299ndash318 2014

[23] N Yuhanna andM Gualtieri ldquoElasticity Automation and Pay-As-You-Go Compel Enterprise Adoption Of Hadoop in theCloud Q2 2016rdquo

[24] S Ullah M D Awan and M S Khiyal ldquoA price-performanceanalysis of EC2 google compute and rackspace cloud providersfor scientific computingrdquo Journal of Mathematics and ComputerScience vol 16 no 02 pp 178ndash192 2016

[25] J Hurwitz A Nugent F Halper and M Kaufman Big Data forDummies John Wiley Sons 2013

[26] S K Garg S Versteeg and R Buyya ldquoA framework for rankingof cloud computing servicesrdquo Future Generation Computer Sys-tems vol 29 no 4 pp 1012ndash1023 2013

[27] D Wu S Sakr and L Zhu ldquoBig data programming modelsrdquo inHandbook of Big Data Technologies pp 31ndash63 2017

[28] S Garcıa S Ramırez-Gallego J Luengo J M Benıtez and FHerrera ldquoBig data preprocessing methods and prospectsrdquo BigData Analytics vol 1 no 1 2016

[29] A Li X Yang S Kandula andM Zhang ldquoCloudCmp compar-ing public cloud providersrdquo in Proceedings of the 10th ACMSIGCOMM Conference on Internet Measurement (IMC rsquo10) pp1ndash14 ACM Melbourne Australia November 2010

[30] J Veiga R R Exposito X C Pardo G L Taboada and JTourifio ldquoPerformance evaluation of big data frameworks forlarge-scale data analyticsrdquo in Proceedings of the 4th IEEE Inter-national Conference on Big Data Big Data 2016 pp 424ndash431December 2016

[31] I Mavridis and H Karatza ldquoLog file analysis in cloud withApacheHadoop andApache Sparkrdquo inProceedings of the SecondInternational Workshop on Sustainable Ultrascale ComputingSystems (NESUS 2015) Poland 2015

[32] M Zaharia M Chowdhury and T Das ldquoFast and interactiveanalytics over Hadoop data with Sparkrdquo USENIX Login vol 37no 4 pp 45ndash51 2012

[33] S Vellaipandiyan and P V Raja ldquoPerformance evaluation ofdistributed framework over YARN clustermanagerrdquo inProceed-ings of the 2016 IEEE International Conference on ComputationalIntelligence and Computing Research ICCIC 2016 December2016

[34] V Taran O Alienin S Stirenko Y Gordienko and A RojbildquoPerformance evaluation of distributed computing environ-ments with Hadoop and Spark frameworksrdquo in Proceedingsof the 2017 IEEE International Young Scientistsrsquo Forum onApplied Physics and Engineering (YSF) pp 80ndash83 Lviv UkraineOctober 2017

[35] S Gopalani and R Arora ldquoComparing Apache Spark and MapReduce with performance analysis using K-meansrdquo Interna-tional Journal of Computer Applications vol 113 no 1 pp 8ndash112015

[36] M Bertoni S Ceri A Kaitoua and P Pinoli ldquoEvaluating cloudframeworks on genomic applicationsrdquo in Proceedings of the 3rdIEEE International Conference on Big Data IEEE Big Data 2015pp 193ndash202 November 2015

[37] S Hukerikar R A Ashraf and C Engelmann ldquoTowards newmetrics for high-performance computing resiliencerdquo in Pro-ceedings of the 7th Fault Tolerance for HPC at eXtreme ScaleWorkshop FTXS 2017 pp 23ndash30 June 2017

[38] R Lu G Wu B Xie and J Hu ldquoStream bench towards bench-markingmoderndistributed streamcomputing frameworksrdquo inProceedings of the 7th IEEEACM International Conference onUtility and Cloud Computing UCC 2014 pp 69ndash78 December2014

[39] L Gu and H Li ldquoMemory or time performance evaluation foriterative operation on Hadoop and Sparkrdquo in Proceedings of the15th IEEE International Conference on High Performance Com-puting and Communications HPCC 2013 and 11th IEEEIFIPInternational Conference on Embedded andUbiquitous Comput-ing EUC 2013 pp 721ndash727 November 2013

[40] M A Lopez A G P Lobato and O C M B Duarte ldquoA per-formance comparison of open-source stream processing plat-formsrdquo in Proceedings of the 59th IEEE Global CommunicationsConference GLOBECOM 2016 December 2016

[41] J L Gustafson ldquoReevaluating Amdahlrsquos lawrdquo Communicationsof the ACM vol 31 no 5 pp 532-533 1988

[42] D Garcıa-Gil S Ramırez-Gallego S Garcıa and F HerreraldquoA comparison on scalability for batch big data processing onApache Spark and Apache Flinkrdquo Big Data Analytics vol 2 no1 2017

[43] P Jakovits and S N Srirama ldquoEvaluating Mapreduce frame-works for iterative scientific computing applicationsrdquo in Pro-ceedings of the 2014 International Conference on High Perfor-mance Computing and Simulation HPCS 2014 pp 226ndash233 July2014

[44] C Boden A Spina T Rabl and V Markl ldquoBenchmarking dataflow systems for scalable machine learningrdquo in Proceedings ofthe 4th ACM SIGMODWorkshop on Algorithms and Systems forMapReduce and Beyond BeyondMR 2017 May 2017

[45] N Spangenberg M Roth and B Franczyk Evaluating NewApproaches of Big Data Analytics Frameworks vol 208 ofLecture Notes in Business Information Processing PublisherName Springer Cham 2015

[46] J Shi Y Qiu U FMinhas et al ldquoClash of the titansMapReducevs Spark for large scale data analyticsrdquo in Proceedings of the 41stInternational Conference on Very Large Data Bases pp 2110ndash2121 Kohala Coast HI USA 2015

[47] M Kang and J Lee ldquoA comparative analysis of iterativeMapRe-duce systemsrdquo in Proceedings of the the Sixth InternationalConference pp 61ndash64 Jeju Repbulic of Korea October 2016

[48] H Lee M Kang S-B Youn J-G Lee and Y Kwon ldquoAnexperimental comparison of iterativeMapReduce frameworksrdquo

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 17: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

Scientific Programming 17

in Proceedings of the 25th ACM International Conference onInformation and Knowledge Management CIKM 2016 pp2089ndash2094 October 2016

[49] S Chintapalli D Dagit B Evans et al ldquoBenchmarking stream-ing computation engines Storm Flink and Spark streamingrdquoin Proceedings of the 30th IEEE International Parallel andDistributed Processing Symposium Workshops IPDPSW 2016pp 1789ndash1792 May 2016

[50] X Zhang C Liu S Nepal W Dou and J Chen ldquoPrivacy-preserving layer over MapReduce on cloudrdquo in Proceedings ofthe 2nd International Conference on Cloud and Green Comput-ing CGC 2012 Held Jointly with the 2nd International Con-ference on Social Computing and Its Applications SCA 2012 pp304ndash310 chn November 2012

[51] L Chung BANixon andE Yu ldquoUsing nonfunctional require-ments to systematically select among alternatives in architec-tural designrdquo in Proceedings of the 1st International Workshopon Architectures for Software Systems pp 31ndash43 1995

[52] S Qian G Wu J Huang and T Das ldquoBenchmarking moderndistributed streaming platformsrdquo in Proceedings of the 2016IEEE International Conference on Industrial Technology (ICIT)pp 592ndash598 Taipei Taiwan March 2016

[53] F Dambreville and A Toumi ldquoGeneric and massively concur-rent computation of belief combination rulesrdquo in Proceedingsof the the International Conference on Big Data and AdvancedWireless Technologies pp 1ndash6 Blagoevgrad BulgariaNovember2016

[54] R W Techentin M W Markland R J Poole D R H C RHaider and B K Gilbert ldquoPage rank performance evaluationof cluster computing frameworks onCrayUrika-GX supercom-puterrdquo Computer Science and Engineering vol 6 pp 33ndash382016

[55] O-C Marcu A Costan G Antoniu and M S Perez-Hernandez ldquoSpark versus flink understanding performance inbig data analytics frameworksrdquo in Proceedings of the 2016 IEEEInternational Conference on Cluster Computing CLUSTER 2016pp 433ndash442 Taipei Taiwan September 2016

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom

Page 18: Big Data in Cloud Computing: A Resource Management Perspectivedownloads.hindawi.com/journals/sp/2018/5418679.pdf · 2019-07-30 · ReviewArticle Big Data in Cloud Computing: A Resource

Computer Games Technology

International Journal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom

Journal ofEngineeringVolume 2018

Advances in

FuzzySystems

Hindawiwwwhindawicom

Volume 2018

International Journal of

ReconfigurableComputing

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Applied Computational Intelligence and Soft Computing

thinspAdvancesthinspinthinsp

thinspArtificial Intelligence

Hindawiwwwhindawicom Volumethinsp2018

Hindawiwwwhindawicom Volume 2018

Civil EngineeringAdvances in

Hindawiwwwhindawicom Volume 2018

Electrical and Computer Engineering

Journal of

Journal of

Computer Networks and Communications

Hindawiwwwhindawicom Volume 2018

Hindawi

wwwhindawicom Volume 2018

Advances in

Multimedia

International Journal of

Biomedical Imaging

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Engineering Mathematics

International Journal of

RoboticsJournal of

Hindawiwwwhindawicom Volume 2018

Hindawiwwwhindawicom Volume 2018

Computational Intelligence and Neuroscience

Hindawiwwwhindawicom Volume 2018

Mathematical Problems in Engineering

Modelling ampSimulationin EngineeringHindawiwwwhindawicom Volume 2018

Hindawi Publishing Corporation httpwwwhindawicom Volume 2013Hindawiwwwhindawicom

The Scientific World Journal

Volume 2018

Hindawiwwwhindawicom Volume 2018

Human-ComputerInteraction

Advances in

Hindawiwwwhindawicom Volume 2018

Scientic Programming

Submit your manuscripts atwwwhindawicom