[ieee 2014 11th international joint conference on computer science and software engineering (jcsse)...

6
Scaling Hadoop Clusters with Virtualized Volunteer Computing Environment Ekasit Kijsipongse and Suriya U-ruekolan Large-Scale Simulation Research Laboratory National Electronics and Computer Technology Center (NECTEC) 112 Thailand Science Park, Pahon Yothin Rd., Klong 1, Klong Luang, Pathumthani 12120 Thailand Email: ekasit.kijsipongse|[email protected] Abstract—MapReduce framework has commonly been used to perform large-scale data processing, such as social network analy- sis, data mining as well as machine learning, on cluster computers. However, building a large dedicated cluster for MapReduce is not cost effective if the system is underutilized. To speedup the MapReduce computation with low cost, the computing resources donated from idle desktop/notebook computers in an organization become true potential. The MapReduce framework is then implemented into Volunteer Computing environment to allow such data processing tasks to be carried out on the unused computers. Virtualization technology is deployed to resolve the security and heterogeneity problem in Volunteer Computing so that the MapReduce jobs can always run under a unified runtime and isolated environment. This paper presents a Hadoop cluster that can be scaled into virtualized Volunteer Computing environment. The system consists of a small fixed set of dedicate nodes plus a variable number of volatile volunteer nodes which give additional computing power to the cluster. To this end, we consolidate Apache Hadoop, the most popular MapReduce im- plementation, with the virtualized BOINC platform. We evaluate the proposed system on our testbed with MapReduce benchmark that represents different workload patterns. The performance of the Hadoop cluster is measured when its computing capability is expanded with volunteer nodes. The results show that the system can be scaled preferably for CPU-intensive jobs, as opposed to data-intensive jobs which their scalability is more restricted. KeywordsMapReduce, Volunteer Computing, Virtualization I. I NTRODUCTION Large amount of data are being generated from various sources such as life sciences and sensor networks, social and mobile applications. This brings up the challenge of efficient programming frameworks and platforms for storing and processing such huge data into attention. One of the popular approaches is MapReduce framework [1] which has emerged as an important paradigm for large-scale data analy- sis. MapReduce essentially consists of 3 phrases: Map, Shuffle, and Reduce phrases. In Map phrase, the input data are divided into several splits, each of which is associated with a map task. Each map task can be processed independently on a different mapper node. The mapper nodes read and extract the data from the input splits. The output of the map tasks are intermediate data in the form of <key,value> pairs stored in the mapper nodes. In Shuffle phrase, the intermediate data are sorted on the mapper nodes and all pairs with the same key are transferred to a specific reducer node. There could be multiple of such reducer nodes, each of which processes distinct keys. In Reduce phrase, the reducer nodes perform sort/merge to combine the intermediate data from many mapper nodes. They aggregate the intermediate data having the same key by applying a user-defined reduction function. Finally, the reducer nodes write out the final output. Applications written in the MapReduce framework can be executed in parallel on multiple nodes. Apache Hadoop [2], an open source software, is the most widely used MapReduce implementation in cluster comput- ers. The core components of Hadoop consist of the Hadoop MapReduce and Hadoop Distributed File System (HDFS). Hadoop MapReduce is a massively, scalable parallel data processing platform on cluster computers. HDFS provides a distributed and highly reliable storage for large data on the cluster using commonly available server hardware. Both components work in tandem to schedule tasks close to the data to reduce I/O overhead. With the Hadoop built-in fault tolerance, the cluster is resilient to many types of failure allowing it to scale out to hundreds of nodes. However, building large dedicated Hadoop clusters is not always feasible or cost-effective for many organizations since the purchasing, operational and maintenance cost is too high but the systems are not fully utilized, most of the time. Volunteer Computing, on the other hand, allows users to donate their unused computing resources to execute time- consuming tasks. Volunteer Computing can be deployed at Internet scale where a thousands of anonymous and un- trusted users are from many countries like SETI@home [3]. Desktop Grid is another type of Volunteer Computing de- ployed in smaller scale like a company or a university where the participating computers are more accountability and less anonymity. By leverage the unused computing resources from desktop/notebook computers during lunch or at night time, it is possible to speedup the MapReduce computation with less cost. However, one major hindrance in the Volunteer Computing realization is the heterogeneity problem. The appli- cation developers must port their applications for all possible hardware/software of the volunteer nodes which require non- trivial effort. Merging virtualization technology could alleviate the problem as the developers can implement and run their applications in a unified runtime environment. This paper presents a Hadoop cluster that can be scaled into virtualized Volunteer Computing environment. We ap- ply the BOINC volunteer computing platform to harness the unused computing capabilities of the idle machines for Hadoop MapReduce computation. By encapsulating Hadoop in 2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE) 146 ,(((

Upload: suriya

Post on 16-Mar-2017

218 views

Category:

Documents


2 download

TRANSCRIPT

Scaling Hadoop Clusters with Virtualized Volunteer

Computing Environment

Ekasit Kijsipongse and Suriya U-ruekolan

Large-Scale Simulation Research Laboratory

National Electronics and Computer Technology Center (NECTEC)

112 Thailand Science Park, Pahon Yothin Rd., Klong 1, Klong Luang, Pathumthani 12120 Thailand

Email: ekasit.kijsipongse|[email protected]

Abstract—MapReduce framework has commonly been used toperform large-scale data processing, such as social network analy-sis, data mining as well as machine learning, on cluster computers.However, building a large dedicated cluster for MapReduce isnot cost effective if the system is underutilized. To speedup theMapReduce computation with low cost, the computing resourcesdonated from idle desktop/notebook computers in an organizationbecome true potential. The MapReduce framework is thenimplemented into Volunteer Computing environment to allowsuch data processing tasks to be carried out on the unusedcomputers. Virtualization technology is deployed to resolve thesecurity and heterogeneity problem in Volunteer Computingso that the MapReduce jobs can always run under a unifiedruntime and isolated environment. This paper presents a Hadoopcluster that can be scaled into virtualized Volunteer Computingenvironment. The system consists of a small fixed set of dedicatenodes plus a variable number of volatile volunteer nodes whichgive additional computing power to the cluster. To this end, weconsolidate Apache Hadoop, the most popular MapReduce im-plementation, with the virtualized BOINC platform. We evaluatethe proposed system on our testbed with MapReduce benchmarkthat represents different workload patterns. The performance ofthe Hadoop cluster is measured when its computing capability isexpanded with volunteer nodes. The results show that the systemcan be scaled preferably for CPU-intensive jobs, as opposed todata-intensive jobs which their scalability is more restricted.

Keywords—MapReduce, Volunteer Computing, Virtualization

I. INTRODUCTION

Large amount of data are being generated from varioussources such as life sciences and sensor networks, socialand mobile applications. This brings up the challenge ofefficient programming frameworks and platforms for storingand processing such huge data into attention. One of thepopular approaches is MapReduce framework [1] which hasemerged as an important paradigm for large-scale data analy-sis. MapReduce essentially consists of 3 phrases: Map, Shuffle,and Reduce phrases. In Map phrase, the input data are dividedinto several splits, each of which is associated with a maptask. Each map task can be processed independently on adifferent mapper node. The mapper nodes read and extractthe data from the input splits. The output of the map tasks areintermediate data in the form of <key,value> pairs stored inthe mapper nodes. In Shuffle phrase, the intermediate data aresorted on the mapper nodes and all pairs with the same key aretransferred to a specific reducer node. There could be multipleof such reducer nodes, each of which processes distinct keys.In Reduce phrase, the reducer nodes perform sort/merge to

combine the intermediate data from many mapper nodes.They aggregate the intermediate data having the same key byapplying a user-defined reduction function. Finally, the reducernodes write out the final output. Applications written in theMapReduce framework can be executed in parallel on multiplenodes.

Apache Hadoop [2], an open source software, is the mostwidely used MapReduce implementation in cluster comput-ers. The core components of Hadoop consist of the HadoopMapReduce and Hadoop Distributed File System (HDFS).Hadoop MapReduce is a massively, scalable parallel dataprocessing platform on cluster computers. HDFS providesa distributed and highly reliable storage for large data onthe cluster using commonly available server hardware. Bothcomponents work in tandem to schedule tasks close to thedata to reduce I/O overhead. With the Hadoop built-in faulttolerance, the cluster is resilient to many types of failureallowing it to scale out to hundreds of nodes. However,building large dedicated Hadoop clusters is not always feasibleor cost-effective for many organizations since the purchasing,operational and maintenance cost is too high but the systemsare not fully utilized, most of the time.

Volunteer Computing, on the other hand, allows users todonate their unused computing resources to execute time-consuming tasks. Volunteer Computing can be deployed atInternet scale where a thousands of anonymous and un-trusted users are from many countries like SETI@home [3].Desktop Grid is another type of Volunteer Computing de-ployed in smaller scale like a company or a university wherethe participating computers are more accountability and lessanonymity. By leverage the unused computing resources fromdesktop/notebook computers during lunch or at night time,it is possible to speedup the MapReduce computation withless cost. However, one major hindrance in the VolunteerComputing realization is the heterogeneity problem. The appli-cation developers must port their applications for all possiblehardware/software of the volunteer nodes which require non-trivial effort. Merging virtualization technology could alleviatethe problem as the developers can implement and run theirapplications in a unified runtime environment.

This paper presents a Hadoop cluster that can be scaledinto virtualized Volunteer Computing environment. We ap-ply the BOINC volunteer computing platform to harnessthe unused computing capabilities of the idle machines forHadoop MapReduce computation. By encapsulating Hadoop in

2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)

146

a virtual machine, we can resolve the portability and securityproblem, the two main concerns in the adoption of VolunteerComputing. The rest of the paper is organized as follows.Section II gives the background information and related work.Section III describes the design and architecture. Section IVexplains experiment setup, benchmark, and evaluation results.Section V draws conclusions and discusses some future work.

II. BACKGROUND AND RELATED WORK

A. Volunteer Computing

Volunteer Computing is a concept in the distributed systemsthat the users donate computing time and resources of theirdesktop/notebook computers to solve a specific problem. Com-puters participating into Volunteer Computing are, for instance,desktop computers in a company, computers in a universitylab, or even notebook computers from anonymous users.These computers usually have large proportion of computingresources left unused from time to time such as during lunchor at night time. Thus, it is reasonably useful to aggregatethese computing resources to help speedup the execution oftime-consuming tasks. Volunteer Computing usually works inthe master/client model. Tasks are queued up in the masterserver waiting for the client to fetch and execute. The userof the Volunteer Computing installs a client program ontheir computers. This client program keeps monitoring if thecomputers are in the idle state. If this is the case, the clientprogram will request a task from the master server, executeit, and return the result to the master server. To prevent thevolunteer task from being intrusive to the regular user tasks,the client program can stop or suspend the task if the usersresume their works on the computers.

BOINC (Berkeley Open Infrastructure for NetworkComputing)[4] is a widely used platform for Volunteer Com-puting. BOINC consists of two components: BOINC serverand client software. BOINC server has responsibility to dis-patch a job, supply job input, track job execution, and storethe job output. BOINC clients interact with the users forregistration to a volunteer project, downloading, and executinga job voluntarily.

B. Virtualization Technology

Virtualization is a technique to create one or several virtualcomputers/machines on a physical machine. The virtual ma-chines (VMs) are sometimes called guests while the physicalmachines called hosts. Hypervisor is referred to as softwarethat create and manage VMs on hosts. Operating systemsand applications running in a VM are decoupled from thecomputer system of the host they are executing on, allowingthem to execute on various types of the underlying resources.Virtualization technology is widely used these days to improvethe utilization and manageability of the resources as well asthe portability of the applications. Virtualization also enhancessecurity since VMs can run in isolation to each other and thehost. Virtualization technology has already been used in severalrelated work as described below.

In Volunteer Computing, virtualization can reduce theapplication developer’s effort to port their applications for allcombination of hardware and software in order to leverage allpossible resources. Besides, it isolates the volunteer tasks from

running natively on the hosts which may pose some securityconcerns to the computer owners. BOINC recently includesvirtualization functionality to guard the application developersfrom the portability problem. It provides the VBoxWrapperprogram [5] that acts as an interface between the BOINC clientand Virtualbox [6] hypervisor. The BOINC job contains a VMimage to be instantiated by the VBoxWrapper. The host andVM can communicate through a shared folder which is usedto store application programs, input and output data. However,VBoxWrapper supports only the NAT networking mode whichdoes not allow any incoming traffic to the VMs. Similarly,McGilvary et al. [7] proposed virtual BOINC or V-BOINCthat runs a BOINC client under virtualized environment. Aregular BOINC client is encapsulated inside a VM imagewhich is preconfigured with the required working environment.The V-BOINC server distributes the VM as a BOINC job.The V-BOINC client, the modified BOINC client, running onthe host receives a job and starts the VM. Thus, the regularBOINC client can functions under the required environmentindependent to the physical resources. With V-BOINC, theapplication developer can enable the bridged networking modeto allow incoming traffic to the VMs. Recently, the project[8] integrates CernVM [9] with VBoxWrapper to execute dataanalysis task from LHC physics experiments under BOINCenvironments.

Applying virtualization technology in Hadoop MapRe-duce has been studied in several literature. Automatic vir-tual Hadoop cluster provisioning for XenServer virtualizationplatform has been discussed in [10]. Ye et al. [11] proposeda vHadoop platform that deploys virtual Hadoop clusters onthe Xen hypervisor for machine learning jobs. The vHadoopplatform is triggered by a machine learning job to stage inputdata into HDFS, start a virtual Hadoop cluster, and collectoutput data from HDFS. However, their experiment on theclustering algorithms does not show the scalability on thevirtual Hadoop cluster since the dataset is too small. Yang et al.[12] showed that different hypervisors affects the performanceof virtual Hadoop cluster. Ibrahim et al. [13] compared theperformance of virtual Hadoop against the native one onphysical machines using simple HDFS read/write, Word Countand sorting benchmark. They concluded that the performanceon physical machines is always better than on VMs. VMware[14] also reported that the average performance of virtualHadoop cluster on VMware vSphere is slightly lower than thenative one. Isshii et al. [15] found that the performance of data-intensive jobs gains negative impact from virtualization thanthat of CPU-intensive jobs. The number of VMs per host canhave significant impact to the performance of virtual Hadoopcluster. Surprisingly, in some configuration [13], multiple VMsper host reduce the performance; but in others [14], [15], theyincrease.

C. Native MapReduce on Volunteer Computing

Though, MapReduce framework is originally designed torun on the dedicated and locally connected computers like clus-ter computers, there have been several attempts to implementthe framework to utilize the volunteer computing resources.MOON [16] is proposed to address the challenges of runningHadoop MapReduce in Volunteer Computing environment.MOON consists of a set of dedicated nodes and variablenumber of volunteer nodes. The dedicated nodes are used to

2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)

147

HadoopTaskTracker VM

Volunteer Node

VirtualBox

V-BOINCClient

VBOINCServer

HadoopJobTrackerNameNode

HadoopDataNode

1. request VM

2. send VM and script

3. create/attach disk

4. start VM

5. get job

Other HadoopTaskTracker

7. shuffle data

8. save data

VMImage

6. get data

Fig. 1. Architecture

provide persistent data storage to Hadoop MapReduce whilevolunteer nodes help increase the number of data replicasand the performance of task execution. MOON provides owntask scheduling and data replication services to make HadoopMapReduce applications work reliably and efficiently even ifvolunteer nodes leave the system. However, the MOON is notbuilt on the BOINC platform and the Hadoop MapReduceworkers are not wrapped inside a VM running on volunteernodes for maximum portability. Costa et al. [17], [18] proposeda Volunteer Computing implementation that can run MapRe-duce jobs. They have modified BOINC to allow client-to-client data transfers in the Shuffle phrase between mappers andreducers which help reduce data download time and networktraffic to/from the BOINC server. One drawback of the currentprototype is that it only supports own MapReduce APIs. Sinceit is not trivial to modify the applications written in HadoopAPIs into their MapReduce APIs for the time being, it is anobstacle in wide use. None of the aforementioned works hasapplied the virtual Hadoop cluster in Volunteer Computing.

III. DESIGN AND ARCHITECTURE

This section describes the design and implementation ofvirtual Hadoop on Volunteer Computing. We currently focuson a enterprise-scale Volunteer Computing environment likeDesktop Grid. The system consists of five components asshown in Figure 1.

• V-BOINC Server [7]: The modified BOINC serverresponsible to distribute the VM image of HadoopTaskTracker to volunteer nodes.

• V-BOINC Client [7]: The modified BOINC clientrunning on volunteer nodes. It is used to downloadVM images from V-BOINC servers and instructs thevirtualization hypervisor to instantiate a VM. One ofthe main reasons to choose V-BOINC is due to theability to setup the bridged networking mode which isrequired for the Hadoop TaskTracker VM to exchangedata to others.

• Persistent Hadoop cluster: This is a native Hadoopinstallation on dedicated nodes. It consists of a Job-Tracker and a NameNode running on a Hadoop master

node. DataNode and TaskTracker are installed on theother nodes for being HDFS storage and worker nodes.

• VM Image and its template: It contains the base OSand Hadoop TaskTracker which will execute MapRe-duce tasks. The VM template specifies the virtualmachine configuration such as the number of CPUcores, memory size, and network device. The bridgednetworking mode is required for the VM to access thephysical network of the hosts so as to allow incomingand outgoing traffic directly to/from VM. The VMimage is configured with dynamic IP address.

• Volunteer Node: The desktop/notebook machine thatparticipates the Volunteer Computing. The volunteernodes provides transient computing resources to theHadoop cluster from time to time. Every volunteernode is required to install the VirtualBox hypervisor[6] and V-BOINC client.

The operation and interaction of each component is de-scribed as follows.

1) A volunteer user uses V-BOINC client to attach avolunteer project at V-BOINC server. Afterwards, theV-BOINC client requests a Hadoop TaskTracker VMimage from the V-BOINC server.

2) The Hadoop TaskTracker VM image and the VMtemplate as well as an executable script for startingthe VM is transferred to the V-BOINC client.

3) The executable script is executed via the V-BOINCclient on the volunteer node. Computing resourcessuch as CPU, memory, disk and network deviceis allocated to the VM according to the resourcespecification defined in VM template.

4) The Hadoop TaskTracker VM is then started andassigned with an IP address from DHCP server.

5) The Hadoop TaskTracker in the VM contacts theHadoop JobTracker for a MapReduce task.

6) During the task execution, the task reads the inputdata from Hadoop HDFS data nodes.

7) In the shuffle phrase of MapReduce, the intermedi-ate data are exchanged among TaskTracker in bothdedicated and volunteer nodes.

2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)

148

8) After the task has been completed, the output dataare saved to the HDFS data nodes.

A volunteer user can specify the <cpu usage limit>threshold in the V-BOINC client configuration file whichallows the V-BOINC client to automatically suspend the VMwhen the percentage of CPU utilization exceeds the threshold.This makes the TaskTracker become inactive. Once the CPUutilization is lower than the threshold, the VM is resumed andthe TaskTracker continues to function again. Otherwise, thevolunteer user can manually control the VM operation via theV-BOINC client, such as resume, suspend and stop function,which mean to join and leave the MapReduce computing ashe/she wants.

IV. EXPERIMENTS AND RESULTS

This section describes how we evaluate the proposedsystem on our testbed with benchmark consisting of differentworkload characteristics. We measure the performance of theHadoop cluster when its computing capability is expandedwith volunteer nodes. The virtualization overhead used in oursystem is observed and the effect of volatility from volunteernodes to the Hadoop cluster have also been tested.

A. Experimental Configuration

Our experimental testbed consists of 4 native nodes and 6volunteer nodes. All of them are Intel dual core, 2.2 GHz, 4GB RAM, 200 GB harddisk and 1 Gb Ethernet connected toa departmental network switch. CentOS 6.4 is used as the op-erating system on all nodes. VirtualBox 4.3 is installed on thevolunteer nodes to provide the virtualization layer for VMs. Wedeploy V-BOINC [7] as the virtualized Volunteer Computingplatform. Each job consists of 500 MB compressed VM imagefile of a Hadoop worker which will be uncompressed into 1.3GB on the volunteer nodes. It is configured to instantiate as aVM with dual core, 2 GB RAM and 50 GB harddisk.

Hadoop 1.2.1 are used throughout the experiments. Thenative nodes are persistently dedicated to run Hadoop clusterwhich one node is used for HDFS NameNode, JobTracker,and TaskTracker; while the remaining three for DataNode andTaskTracker. In this cluster, we set the HDFS block size to 64MB and replication factor to 1.

B. Benchmark

The benchmark used in our experiments consists of sev-eral applications ranging from CPU to data intensive jobs.These applications should represent several common workloadpatterns in the current Hadoop use. The benchmark consistsof Pi calculation and sorting, which are sample applicationsfrom Hadoop distribution, as well as Word Count, KMeansclustering and Pagerank network analysis from the HiBench[19] benchmark suite. Below gives a short description of thebenchmark.

Pi calculation estimates the Pi value by using the MonteCarlo method. Each mapper independently calculates the Piand emits its value. Then, a single reducer collects the resultsfrom all mappers. We run Pi calculation with 100 map tasksand 10

9 samples. This application belongs to the pure CPUintensive class since it has almost no data transfer.

The sorting benchmark performs sorting of the data fromthe input files into the output files. The mappers extract andemit each record from the files as it is. Sorting is carried outduring the Shuffle phrase of the MapReduce framework. Thereducers just execute an identity function. The input data of50 GB is randomly generated by using the TeraGen in Hadoopdistribution. We use 6 reducers for this benchmark. The sortingbenchmark is a data-intensive job whose the size of input,intermediate and output data are almost equal.

Word Count finds the frequency of each word in all inputfiles. The mapper scans the files and, for every word, emits apair of <word,1>. The combiner partially aggregates the pairswhose key is the same word from the local mapper output.The reducer performs the final aggregation for the frequencyof each word. The 100 GB input data are randomly generatedusing the RandomTextWriter in the Hadoop distribution. WordCount represents the data intensive jobs having large input butsmall intermediate data and output.

KMeans clustering is a data mining application to organizea set of objects into groups of similar objects. Given the initialcentroids of k clusters, the mappers calculate the cluster foreach object and emit the partial centroid data. The reduceraggregates the partial centroid data into the final centroids.Then the process repeats with the output (centroids) of thereducer being the input of the mappers in the next iterationuntil converged. Synthetic data of 2GB having 10M samplesand 20 dimensions are used in this benchmark. We set thenumber of clusters and the number of iterations to 20 and40, respectively. KMeans clustering represents an iterativeMapReduce application which runs a sequence of jobs. Ineach iteration, the size of mapper input is large; while theintermediate and output is small.

Pagerank calculates the relative importance of vertices ina graph structure, like web graph, based on link analysis. Thecore calculation is an iterative matrix-vector multiplication.Each iteration consists of 2-stage MapReduce job. In the firststage, mappers distribute the initial vector of page ranks andrelated link weights to a particular reducer. Each reduceremits products of page ranks and weights which are storedin temporary files. In the second stage, mappers read andoutput the products from the temporary files. The reducers thenaggregate the relevant products into sum for the updated pageranks. These two stages repeat until the vector of page ranksconverges. The input data consists of generated 5M verticesand 200M edges in 3GB. We set the number of reducers andthe number of iterations to 6 and 5, respectively. This Pagerankbenchmark has large input, intermediate, and output data in thefirst stage. The second stage has large input and intermediatedata but small output.

C. Volunteer Computing and Virtualization Overhead

For Volunteer Computing overhead, the time for the nodeto join the project on the V-BOINC server and download anentire job is around 90 seconds on the average. After that, thetime to uncompress and instantiate the VM until it completelystarts on the node is 20 seconds. This overhead happens onlyonce as long as the user remains with the project.

Next, we evaluate the virtualization overhead by comparingthe performance between the virtual and native Hadoop node.

2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)

149

0

0.5

1

1.5

2

2.5

Pi Wordcount Kmeans Pagerank Sort

No

rmal

ized

Tim

e

Fig. 2. Virtualization overhead

In this experiment, only one TaskTracker is executed andwe measure the execution time of each application when theTaskTracker running inside the virtual machine and on thevolunteer host. The benchmark is executed three times and thenormalized average values are plotted in Figure 2. Clearly, theexecution time of the virtual Hadoop node is longer than that ofnative one in all cases as shown by the values higher than 1.0.The performance of Pi calculation which is a CPU intensivejob receives small impact from the virtualization overhead.Word Count, KMeans and Pagerank have higher overhead.Sorting is the highest. Our results agree with [15] that theperformance of Hadoop on virtualized environment decreasesdue to the virtual I/O activities.

D. Scalability

This experiment shows how performance of the applica-tions is scaled when the number of volunteer nodes increases.We run the benchmark on the base Hadoop cluster consistingof 4 native nodes (4P) and vary the number of volunteernodes up to 6 (4P+6V). The average execution time of eachapplication is illustrated in Figure 3. The ratio between thebase execution time of 4 native nodes over the execution timewhen we scale out the Hadoop cluster using volunteer nodes isalso depicted in Figure 4. Pi calculation shows good scalabilityalong with the increasing number of volunteer nodes. WordCount and KMeans clustering have fair scalability. Pagerankhas poor scalability. Sorting shows good scalability duringsmall number of volunteers but it diminishes quickly withmore volunteers. We believe that the scalability of Pagerankand sorting is restricted by having large I/O; but this requiresfurther investigation.

E. Performance under Volatility

This experiment demonstrates that the volatility of volun-teers definitely causes the performance of Hadoop MapReducedegraded. Yet, the Hadoop fault-tolerance mechanism cansurvive from job failure. VMs on volunteers are randomlysuspended and resumed to simulate the busy and idle state ofthe volunteer nodes. The busy and idle state are alternated,and modelled by the exponential distribution with the rateλ1 and λ2, respectively. Setting λ1 and λ2 equally refers tothe average time in idle state being approximately a half of

0

1000

2000

3000

4000

5000

6000

Pi Wordcount Kmeans Pagerank Sort

Tim

e (s

)

4P

4P+1V

4P+2V

4P+3V

4P+4V

4P+5V

4P+6V

Fig. 3. Execution Time

0

0.5

1

1.5

2

2.5

4P 4P+1V 4P+2V 4P+3V 4P+4V 4P+5V 4P+6V

Pi Wordcount Kmeans Pagerank Sort

Fig. 4. Scalability

the overall testing time or, in other words, 50% availability.Figure 5 compares the execution time of each applicationwhen all volunteers are 50% and 100% available during jobexecution. The execution time of 50% availability is longerthan that of 100% since tasks on the volatile nodes are lostand need restarted.

0

1000

2000

3000

4000

5000

6000

Pi Wordcount Kmeans Pagerank Sort

Tim

e (s

)

100% 50%

Fig. 5. Performance under Volatility

2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)

150

V. CONCLUSION

In this paper, we present the design and architecture of theHadoop cluster that can be scaled into virtualized VolunteerComputing environment. The unused computing capabilitiesof the idle machines are utilized for Hadoop MapReducecomputation. We evaluate the proposed system on our testbedwith MapReduce benchmark that represents different workloadpatterns. The performance of the Hadoop cluster is measuredwhen its computing capability is expanded with volunteernodes. The results show that the system can be scaled prefer-ably for CPU-intensive jobs; but scalability is restricted fordata-intensive jobs. The overhead of the virtualized VolunteerComputing and the effect of the volatility from volunteernodes is also observed. In our future work, we will reducethe overhead in downloading the VM image by caching anddownloading the portion of image on demand. We will alsoinvestigate into the performance of virtual I/O and the virtualHadoop in Internet-scale Volunteer Computing environment.

REFERENCES

[1] J. Dean and S. Ghemawat, “Mapreduce: simplified data processing onlarge clusters,” in Proceedings of the 6th conference on Symposium on

Opearting Systems Design & Implementation (OSDI), 2004.

[2] “Apache hadoop,” http://hadoop.apache.org/, 2013.

[3] “Seti@home,” http://setiathome.berkeley.edu/, 2013.

[4] D. P. Anderson, “Boinc: A system for public-resource computing andstorage,” in Proceedings of the 5th IEEE/ACM International Workshop

on Grid Computing (Grid), 2004, pp. 4–10.

[5] “Boinc vboxapps,” http://boinc.berkeley.edu/trac/wiki/VboxApps, 2013.

[6] “Oracle vm virtualbox,” https://www.virtualbox.org/, 2013.

[7] G. McGilvary, A. Barker, A. Lloyd, and M. Atkinson, “V-boinc:The virtualization of boinc,” in Cluster, Cloud and Grid Computing

(CCGrid), 2013 13th IEEE/ACM International Symposium on, 2013.

[8] “Cernvm/vboxwrapper test project,” http://boinc.berkeley.edu/vbox,2013.

[9] P. Buncic, C. A. Sanchez, J. Blomer, L. Franco, A. Harutyunian,P. Mato, and Y. Yao, “Cernvm a virtual software appliance for lhcapplications,” Journal of Physics: Conference Series, vol. 219, no. 4, p.042003, 2010.

[10] H. Mao, Z. Zhang, B. Zhao, L. Xiao, and L. Ruan, “Towards deployingelastic hadoop in the cloud,” in Proceedings of the 2011 International

Conference on Cyber-Enabled Distributed Computing and Knowledge

Discovery, 2011, pp. 476–482.

[11] K. Ye, X. Jiang, Y. He, X. Li, H. Yan, and P. Huang, “vhadoop: Ascalable hadoop virtual cluster platform for mapreduce-based parallelmachine learning with performance consideration,” in Proceedings

of the 2012 IEEE International Conference on Cluster Computing

Workshops (CLUSTERW), 2012, pp. 152–160.

[12] Y. Yang, X. Long, X. Dou, and C. Wen, “Impacts of virtualizationtechnologies on hadoop,” in The third International Conference on

Intelligent System Design and Engineering Applications (ISDEA), 2013,pp. 846–849.

[13] S. Ibrahim, H. Jin, L. Lu, L. Qi, S. Wu, and X. Shi, “Evaluatingmapreduce on virtual machines: The hadoop case,” in Proceedings of the

1st International Conference on Cloud Computing (CloudCom), 2009,pp. 519–528.

[14] VMware, “A benchmarking case study of virtualized hadoop per-formance on vmware vsphere,” http://www.vmware.com/files/pdf/techpaper/VMW-Hadoop-Performance-vSphere5.pdf, Tech. Rep., 2011.

[15] M. Ishii, J. Han, and H. Makino, “Design and performance evaluationfor hadoop clusters on virtualized environment,” in Information Net-

working (ICOIN), 2013 International Conference on, 2013, pp. 244–249.

[16] H. Lin, X. Ma, J. Archuleta, W.-c. Feng, M. Gardner, and Z. Zhang,“Moon: Mapreduce on opportunistic environments,” in Proceedings of

the 19th ACM International Symposium on High Performance Dis-

tributed Computing, ser. HPDC ’10. ACM, 2010, pp. 95–106.

[17] F. Costa, L. Silva, and M. Dahlin, “Volunteer cloud computing: Mapre-duce over the internet,” in Proceedings of the 2011 IEEE International

Symposium on Parallel and Distributed Processing Workshops and PhD

Forum (IPDPSW), ser. IPDPSW ’11, 2011.

[18] F. Costa, L. Veiga, and P. Ferreira, “Vmr: volunteer mapreduce over thelarge scale internet,” in Proceedings of the 10th International Workshop

on Middleware for Grids, Clouds and e-Science (MGC), ser. MGC ’12,2012.

[19] S. Huang, J. Huang, J. Dai, T. Xie, and B. Huang, “The hibenchbenchmark suite: Characterization of the mapreduce-based data anal-ysis,” in The IEEE 26th International Conference on Data Engineering

Workshops (ICDEW), 2010.

2014 11th International Joint Conference on Computer Science and Software Engineering (JCSSE)

151