optimal execution of mapreduce jobs in cloud - voices 2015

49
Running MapReduce Programs in Clouds -Anshul Aggarwal Cisco Systems

Upload: deanna-kosaraju

Post on 15-Jul-2015

250 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Running MapReduce Programs in Clouds

-Anshul AggarwalCisco Systems

Page 2: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Cloud Computing….Mapreduce…..Hadoop…..

Page 3: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

What is MapReduce?

• Simple data-parallel programming model designed for scalability and fault-tolerance

• Pioneered by Google

• Processes 20 petabytes of data per day

• Popularized by open-source Hadoop project

• Used at Yahoo!, Facebook, Amazon, …

Page 4: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Why MapReduce Optimization

Page 5: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Outline

• Cloud And MapReduce

• MapReduce architecture

• Example applications

• Getting started with Hadoop

• Tuning MapReduce

Page 6: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Cloud Computing • The emergence of cloud computing

has made a tremendous impact on

the Information Technology (IT) industry

• Cloud computing moved away from personal computers and the individual enterprise application server to services provided by the cloud of computers

• The resources like CPU and storage are provided as general utilities to the users on-demand based through internet

• Cloud computing is in initial stages, with many issues still to be addressed.

Page 7: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

CLOUD COMPUTING SERVICES

Page 8: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Outline

• Cloud And MapReduce

• MapReduce architecture

• Example applications

• Getting started with Hadoop

• Tuning MapReduce

Page 9: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

MapreduceFramework

Page 10: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

MapReduce History

• Historically, data processing was completely done using database technologies. Most of the data had a well-defined structure and was often stored in relational databases

• Data soon reached terabytes and then petabytes

• Google developed a new programming model called MapReduce to handle large-scale data analysis,and later they introduced the model through their seminal paper MapReduce: Simplified Data Processing on Large Clusters.

Page 11: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

What the paper says

Page 12: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Example: Facebook Lexicon

www.facebook.com/lexicon

Page 13: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

What is MapReduce used for?• At Google:

• Index construction for Google Search

• Article clustering for Google News

• Statistical machine translation

• At Yahoo!:

• “Web map” powering Yahoo! Search

• Spam detection for Yahoo! Mail

• At Facebook:

• Data mining

• Ad optimization

• Spam detection

Page 14: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

MapReduce Framework

• computing paradigm for processing data that resides on hundreds of computers

• popularized recently by Google, Hadoop, and many others

• more of a framework

• makes problem solving easier and harder

• inter-cluster network utilization

• performance of a job that will be distributed

• published by Google without any actual source code

Page 15: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

MapReduce Terminology

Page 16: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Outline

• Cloud And MapReduce

• MapReduce Basics

• Example applications

• Getting started with Hadoop

• Tuning MapReduce

Page 17: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Word Count -"Hello World" of MapReduce world.• The word count job accepts an input directory, a mapper

function, and a reducer function as inputs.

• We use the mapper function to process the data in parallel, and we use the reducer function to collect results of the mapper and produce the final results.

• Mapper sends its results to reducer using a key-value based model.

• $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount. WordCount amazon-meta.txt wordcount-output1

Page 18: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

WorkFlow

Page 19: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Example : Word Count

19Map Tasks

ReduceTasks

• Job: Count the occurrences of each word in a data set

Page 20: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Outline

• Cloud And MapReduce

• MapReduce Basics

• Example applications

• Mapreduce Architecture

• Getting started with Hadoop

• Tuning MapReduce

Page 21: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

How Mapreduce Works

At the highest level, there are four independent entities:

• The client, which submits the MapReduce job.

• The jobtracker, which coordinates the job run. The jobtrackeris a Java application whose main class is JobTracker.

• The tasktrackers, which run the tasks that the job has been split into.

• The distributed filesystem (normally HDFS), which is used

for sharing job files between the other entities.

Page 22: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Anatomy of a Mapreduce Job

Page 23: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Developing a MapReduce Application

• The Configuration APIConfiguration conf = new Configuration();

conf.addResource("configuration-1.xml");

conf.addResource("configuration-2.xml");

• GenericOptionsParser, Tool, and ToolRunner

• Writing a Unit Test

• Testing the Driver

• Launching a Job

% hadoop jar hadoop-examples.jar v3.MaxTemperatureDriver -conf conf/hadoop-cluster.xml \ Input/ncdc/all max-temp

• Retrieving the Results

Page 24: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

This is where the Magic Happens

public class MaxTemperatureDriver extends Configured implements Tool {@OverrideJob job = new Job(getConf(), "Max temperature");job.setJarByClass(getClass());FileInputFormat.addInputPath(job, new Path(args[0]));FileOutputFormat.setOutputPath(job, new Path(args[1]));job.setMapperClass(MaxTemperatureMapper.class);job.setCombinerClass(MaxTemperatureReducer.class);job.setReducerClass(MaxTemperatureReducer.class);job.setOutputKeyClass(Text.class);job.setOutputValueClass(IntWritable.class);return job.waitForCompletion(true) ? 0 : 1;}public static void main(String[] args) throws Exception {int exitCode = ToolRunner.run(new MaxTemperatureDriver(), args);System.exit(exitCode);}}

Page 25: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Configuring Map Reduce params

• <configuration>• <property>• <name>mapred.job.tracker</name>• <value>MASTER_NODE:9001</value>• </property>• <property>• <name>mapred.local.dir</name>• <value>HADOOP_DATA_DIR/local</value>• </property>• <property>• <name>mapred.tasktracker.map.tasks.maximum</name>• <value>8</value>• </property>• </configuration>

• $bin/hadoop -cp hadoop-microbook.jar microbook.wordcount.WordCount amazon-meta.txt wordcount-output1

Page 26: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Q & A

Page 27: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Outline

• Cloud And MapReduce

• MapReduce architecture

• Example applications

• Getting started with Hadoop

• Tuning MapReduce

Page 28: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Hadoop Clusters

Page 29: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log,they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but formore systems of computers.—Grace Hopper

Page 30: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Why Hadoop is able to compete?

30

Scalability (petabytes of data, thousands of machines)

Database

vs.

Flexibility in accepting all data formats (no schema)

Commodity inexpensive hardware

Efficient and simple fault-tolerant mechanism

Performance (tons of indexing, tuning, data organization tech.)

Features:

- Provenance tracking- Annotation management- ….

Page 31: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

What is Hadoop

• Hadoop is a software framework for distributed processing of large

datasets across large clusters of computers

• Large datasets Terabytes or petabytes of data

• Large clusters hundreds or thousands of nodes

• Hadoop is open-source implementation for Google MapReduce

• HDFS is a filesystem designed for storing very large files with streaming data access patterns, running on clusters of commodity hardware

31

Page 32: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

What is Hadoop (Cont’d)

• Hadoop framework consists on two main layers

• Distributed file system (HDFS)• Execution engine (MapReduce)

• Hadoop is designed as a master-slave shared-nothing architecture

32

Page 33: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Design Principles of Hadoop

• Automatic parallelization & distribution

• computation across thousands of nodes and Hidden from the end-user

• Fault tolerance and automatic recovery

• Nodes/tasks will fail and will recover automatically

• Clean and simple programming abstraction

• Users only provide two functions “map” and “reduce”

• Need to process big data

• Commodity hardware

• Large number of low-end cheap machines working in parallel to solve a computing problem

33

Page 34: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Hardware Specs

• Memory

• RAM

• Total tasks

• No Raid required

• No Blade server

• Dedicated Switch

• Dedicated 1GB line

Page 35: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Who Uses MapReduce/Hadoop

• Google: Inventors of MapReduce computing paradigm

• Yahoo: Developing Hadoop open-source of MapReduce

• IBM, Microsoft, Oracle

• Facebook, Amazon, AOL, NetFlex

• Many others + universities and research labs

• Many enterprises are turning to Hadoop

• Especially applications generating big data

• Web applications, social networks, scientific applications

35

Page 36: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Hadoop: How it Works

• Hadoop implements Google’s MapReduce, using HDFS

• MapReduce divides applications into many small blocks of work.

• HDFS creates multiple replicas of data blocks for reliability, placing them on compute nodes around the cluster.

• MapReduce can then process the data where it is located.

• Hadoop ‘s target is to run on clusters of the order of 10,000-nodes.

36

Sath

ya S

ai U

niv

ersi

ty, P

rash

anti

N

ilaya

m

Page 37: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

WorkFlow

Page 38: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Hadoop: Assumptions

It is written with large clusters of computers in mind and is built around the following assumptions:

• Hardware will fail.

• Processing will be run in batches.

• Applications that run on HDFS have large data sets.

• It should provide high aggregate data bandwidth

• Applications need a write-once-read-many access model.

• Moving Computation is Cheaper than Moving Data.

• Portability is important.

Page 39: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Complete Overview

Page 40: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Hadoop Distributed File System (HDFS)

40

Centralized namenode- Maintains metadata info about files

Many datanode (1000s)- Store the actual data- Files are divided into blocks- Each block is replicated N times

(Default = 3)

File F 1 2 3 4 5

Blocks (64 MB)

Page 41: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Main Properties of HDFS

• Large: A HDFS instance may consist of thousands of server machines, each storing part of the file system’s data

• Replication: Each data block is replicated many times (default is 3)

• Failure: Failure is the norm rather than exception

• Fault Tolerance: Detection of faults and quick, automatic recovery from them is a core architectural goal of HDFS

• Namenode is consistently checking Datanodes

41

Page 42: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Outline

• Cloud And MapReduce

• MapReduce architecture

• Example applications

• Getting started with Hadoop

• Tuning MapReduce

Page 43: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Tuning Parameters

Page 44: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Mapping workers to Processors• The input data (on HDFS) is stored on the local disks of the machines

in the cluster. HDFS divides each file into 64 MB blocks, and stores

several copies of each block (typically 3 copies) on different

machines.

• The MapReduce master takes the location information of the input

files into account and attempts to schedule a map task on a machine

that contains a replica of the corresponding input data. Failing that, it

attempts to schedule a map task near a replica of that task's input

data. When running large MapReduce operations on a significant

fraction of the workers in a cluster, most input data is read locally and

consumes no network bandwidth.

44

Sath

ya S

ai U

niv

ersi

ty, P

rash

anti

N

ilaya

m

Page 45: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Task Granularity

• The map phase has M pieces and the reduce phase has R pieces.

• M and R should be much larger than the number of worker

machines.

• Having each worker perform many different tasks improves dynamic

load balancing, and also speeds up recovery when a worker fails.

• Larger the M and R, more the decisions the master must make

• R is often constrained by users because the output of each reduce task

ends up in a separate output file.

• Typically, (at Google), M = 200,000 and R = 5,000, using 2,000

worker machines.

45

Sath

ya S

ai U

niv

ersi

ty, P

rash

anti

N

ilaya

m

Page 46: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Speculative Execution – One approach• Tasks may be slow for various reasons, including hardware

degradation or software mis-configuration, but the causes may be hard to detect since the tasks still complete

• successfully, albeit after a longer time than expected. Hadoop doesn’t try to diagnose and fix slow-running tasks;

• instead, it tries to detect when a task is running slower than expected and launches another, equivalent, task as a backup.

Page 47: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

Problem Statement

The problem at hand is defining a resource provisioning framework for MapReduce jobs running in a cloud keeping in mind performance goals such as

Resource utilization with

-optimal number of map and reduce slots

-improvements in execution time

-Highly scalable solution

Page 48: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015

References[1] E. Bortnikov, A. Frank, E. Hillel, and S. Rao, “Predicting execution bottlenecks in map-reduce clusters” In Proc. of the 4th USENIX conference on Hot Topics in Cloud computing, 2012.

[2] R. Buyya, S. K. Garg, and R. N. Calheiros, “SLA-Oriented Resource Provisioning for Cloud Computing: Challenges, Architecture, and Solutions” In International Conference on Cloud and Service Computing, 2011.

[3] S. Chaisiri, Bu-Sung Lee, and D. Niyato, “Optimization of Resource Provisioning Cost in Cloud Computing” in Transactions On Service Computing, Vol. 5, No. 2, IEEE, April-June 2012

[4] L Cherkasova and R.H. Campbell, “Resource Provisioning Framework for MapReduce Jobs with Performance Goals”, in Middleware 2011, LNCS 7049, pp. 165–186, 2011

[5] J. Dean, and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters”, Communications of the ACM, Jan 2008

[6] Y. Hu, J. Wong, G. Iszlai, and M. Litoiu, “Resource Provisioning for Cloud Computing” In Proc. of the 2009 Conference of the Center for Advanced Studies on Collaborative Research, 2009.

[7] K. Kambatla, A. Pathak, and H. Pucha, “Towards optimizing hadoop provisioning in the cloud in Proc. of the First Workshop on Hot Topics in Cloud Computing, 2009

[8] Kuyoro S. O., Ibikunle F. and Awodele O., “Cloud Computing Security Issues and Challenges” in International Journal of Computer Networks (IJCN), Vol. 3, Issue 5, 2011

Page 49: Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015