cascading on starfish

Download Cascading on starfish

Post on 06-May-2015




3 download

Embed Size (px)


  • 1.Cascading on StarshFei Dong Duke University dongfei@cs.duke.eduDecember 10, 20111 IntroductionHadoop [6] is a software framework installed on a cluster to permit large scaledistributed data analysis. It provides the robust Hadoop Distributed FileSystem (HDFS) as well as a Java-based API that allows parallel processingacross the nodes of the cluster. Programs employ a Map/Reduce executionengine which functions as a fault-tolerant distributed computing system overlarge data sets.In addition to Hadoop, which is a top-level Apache project, there are sub-projects related to workow of Hadoop, such as Hive [8], a data warehouseframework used for ad hoc querying (with an SQL type query language);and Pig [9], a high-level data-ow language and execution framework whosecompiler produces sequences of Map/Reduce programs for execution withinHadoop. Cascading [2], an API for dening and executing fault tolerantdata processing workows on a Hadoop cluster. All of mentioned projectssimplify some of work for developers, allowing them to write more traditionalprocedural or SQL-style code that, under the covers, creates a sequence ofHadoop jobs. In this report, we focus on Cascading as the main data-parallelworkow choice.1.1 Cascading IntroductionCascading is a Java application framework that allows you to more easilywrite scripts to access and manipulate data inside Hadoop. There are anumber of key features provided by this API: Dependency-Based Topological Scheduler and MapReduce Planning- Two key components of the cascading API are its ability to sched-ule the invocation of ows based on dependency; with the executionorder being independent of construction order, often allowing for con-current invocation of portions of ows and cascades. In addition, the1

2. steps of the various ows are intelligently converted into map-reduceinvocations against the hadoop cluster. Event Notication - The various steps of the ow can perform noti-cations via callbacks, allowing for the host application to report andrespond to the progress of the data processing. Scriptable - The Cascading API has scriptable interfaces for Jython,Groovy, and JRuby.Although Cascading provides the above benets, we still consider aboutthe balance of the performance and productivity on Cascading. Marz [5]shows some rules to optimize Cascading Flows. For some experienced Cas-cading users, they can gain some performance improvement by followingthose principles in high level. One interesting questions is whether there ex-ist some ways to improve the workow performance without expert knowl-edge. In other words, we want to optimize workow in physical level. InStarsh [7], the authors demonstrate the power of self-tuning jobs on Hadoopand Herodotou has successfully applied optimization technology on Pig.This report will discuss auto-optimization on Cascading with the help ofStarsh.2TerminologyFirst, we introduce some concepts widely used in Cascading. Stream: data input and output. Tuple: stream is composed of a series of Tuples. Tuples are sets ofordered data. Tap: abstraction on top of Hadoop les. Source - A source tap is readfrom and acted upon. Actions on source taps result in pipes. Sink -A sink tap is a location to be written to. A Sink tap can later serveas a Source in the same script. Operations: dene what to do on the data. i.e.: Each(), Group(),CoGroup(), Every(). Pipe: tie Operation together. When an operation is executed upon aTap, the result is a Pipe. In other words, a ow is a pipe with dataowing through it. i.e: Pipes can use other Pipes as input, therebywrapping themselves into a series of operations. Filter: pass through it to remove useless records. i.e. RegexFilter(),And(), Or().2 3. Aggregator: function after group operation. i.e. Count(), Average(),Min(), Max(). Step: a logic unit in Flow. It represents a Map-only or MapReducejob. Flow: A Flow is a combination of a Source, a Sink and Pipe. Cascade:a series of Flows.3Cascading StructureFigure 1: A Typical Cascading Structure.In Figure 1, we can clearly see that Cascading Structure. The top levelis called Cascading which is composed of several ows. In each ow, itdenes a source Tap, a sink Tap and Pipes. We also notice one ow canhave multiple pipes to do data operations like lter, grouping, aggregator.Internally, a Cascade is constructed through the CascadeConnector class,by building an internal graph that makes each Flow a vertex, and each lean edge. A topological walk on this graph will touch each vertex in orderof its dependencies. When a vertex has all its incoming edges available, itwill be scheduled on the cluster. Figure 2 gives us an example which goalis to statistic second and minute count from Apache logs. The dataow isrepresented as a Graph. The rst step is to import and parse source data.Next it generates two following steps to process second and minutesrespectively.The execution order for Log Analysis is:1. calculate the dependency between ows, so we get F low1 F low22. start to call F low1 2.1 initialize import owStep and construct the Job1 2.2 submit import Job1 to Hadoop3. start to call F low2 3.1 initialize minute and secend statistics owSteps and constructthe Job2, Job3 3.2 submit Job2, Job3 to HadoopThe complete code is attached at Appendix. 3 4. Figure 2: Workow Sample:Log Analysis.4 Cascading on Starsh4.1 Change to new Hadoop APIWe notice current Cascading is based on Hadoop Old-API. Since Starshonly works within New-API, the rst work is to connect those heterogeneoussystems. Herodotos works on supporting Hadoop Old-API on Starsh. Iwork on replacing Old-API of Cascading with New-API. Although Hadoopcommunity recommends new API and provide some upgrade advice [11], itstill take us much energy on translating. One reason is the system complexity(40K lines), we sacrice some advanced features such as S3fs, TemplateTap,ZipSplit, Stats reports and Strategy to make the change work. Finally, weprovide a revised version of Cascading that only use Hadoop New-API. Inthe mean time Herodotos updated Starsh to support Old-API recently.While this report will only consider New-API version of Cascading.4.2 Cascading ProlerFirst, we need to decide when to capture the prolers. Since modied Cas-cading is using Hadoop New-API, the position to enable Proler is the sameas a single MapReduce job. We choose the return point of blockT illCompleteOrStopedof cascading.f low.F lowStepJob to collect job execution les when job com- 4 5. pletes. When all of jobs are nished and execution les are collected, wewould like to build a prole graph to represent dataow dependencies amongthe jobs. In order to build the job DAG, we decouple the hierarchy of Cas-cading and Flows. As we see before, Log Analysis workow has two de-pendent Flows and nally will submit three MapReduce jobs on Hadoop.Figure 3 shows the original Workow in Cascading and translating JobGraphin Starsh. We propose the following algorithm to build Job DAG.Algorithm 1 Build Job DAG Pseudo-Code 1: procedure BuildJobDAG(f lowGraph) 2:for f low f lowGraph do Iterate over all ows 3: 4:for f lowStep f low.f lowStepGraph doAdd the jobvertices 5: Create the jobVertex from the owStep 6:end for 7: 8: for edge f low.f lowStepGraph.edgeSet doAdd the jobedges within a ow 9: Create the corresponding edge in the jobGraph10: end for11:end for12:for f lowEdge f lowGraph.edgeSet doIterate over all owedges (source target)13:sourceF lowSteps f lowEdge.sourceF low.getLeaf F lowSteps14:targetF lowSteps f lowEdge.targetF low.getRootF lowSteps15:for sourceF S sourceF lowSteps do16: for targetF S targetF lowSteps do17:Create the job edge from corresponding source to target18: end for19:end for20:end for21: end procedure4.3 Cascading What-if Engine and OptimizerWhat-if Engine is to predict the behavior of a workow W . To achievethat, DAG Prolers ,Data Model, Cluster, DAG Congurations are givenas parameters. Building the Conf Graph shares the same idea as buildingJob Graph. We capture the returning point of initializeN ewJobM ap incascading.cascade where we process what-if requests and exit the programafterwards.5 6. (a) Cascading Represent (b) Dataow Transla-tion Figure 3: Log Analysis. For the Cascading optimizer, I make use of data ow optimizer and feedthe related interface. When running the Optimizer, we keep the defaultOptimizer mode as crossjob + dynamic.4.4 Program InterfaceThe usage of Cascading on Starsh is simple and user-friendly. Users do notneed to change the source code or import new package. We can list somecases as follows. prof ile cascading jar loganalysis.jar Proler: collect task proles when running a workow and generate theprole les in P ROF ILER OU T P U T DIR. execute cascading jar loganalysis.jar Execute: only run program without collecting proles. analyze cascading details workf low 20111017205527 Analyze: list some List basic or detail statistical information regardingall jobs found in the P ROF ILER OU T P U T DIRwhatif details workf low 20111018014128 cascading jar loganalysis.jarWhat-if Engine: ask hypothetical question on a particular workow andreturn predicted proles. optimize run workf low 20111018014128 cascading jar loganalysis.jar Optimizer:Execute a MapReduce workow using the conguration pa-rameter settings automatically suggested by the Cost-based Optimizer. 6 7. 5 Evaluation5.1 Experiment EnvironmentIn the experimental evaluation, we used Hadoop clusters running on AmazonEC2. The following is the detail preparation. Cluster Type: m1.large 10 nodes. Each node has 7.5 GB memory,2 virtual cores, 850 GB storage, set 3 map tasks and 2 reduce tasksconcurrently. Hadoop Congurations: 0.20.203. Cascading Version : modied V1.2.4 (use Hadoop New-API) Data Set: 20G TPC-H [10], 10G random text, 10G pagegraphs forpagerank, 5G paper author pairs. Optimizer Type: cross jobs and dynamic5.2 Description of Data-parallel WorkowsWe evaluate the end-to-end performance of optimizers on seven representa-tive workows used in dierent domains.T