bigdata- survey on scheduling methods in hadoop mapreduce

Acharya Institute of Technology, Bangalore

A technical Seminar on,

A Survey of Scheduling Methods in Hadoop MapReduce Framework

Presented by, Mahantesh C. Angadi M.Tech (CNE) First Year [email protected] Under the Guidance of, Prof. Amogh P. Kulkarni AIT, Bangalore

Dept. of ISE, AIT, Bangalore

Motivation

Introduction

What is BigData…?

What is Hadoop…?

What is HDFS and MapReduce…?

Challenges in MapReduce

Literature Survey on Scheduling in MapReduce

Survey of scheduling methods on proposed methods

Conclusion

References.

Agenda


Motivation

“Necessity” is the Mother of All the Inventions…!

In 2000s, Google faced a serious challenge: To organize the

world’s information.

Google designed a new data processing infrastructure.

i. Google File System (GFS)

ii. MapReduce

In 2004, Google published a paper describing its work to the

Community.

Doug Cutting decided to use the technique Google described.


Introduction

With the current trend in increased use of internet in

everything, lot of data is generated and need to be analysed.

Web search engines and social networking sites capture and

analyze every user action on their sites to improve site

design, detect spam, and find advertising opportunities.

The processing of this can be best done using Distributed

computing and parallel processing mechanisms.

Hadoop MapReduce is one of the most popularly used such

technique for handling the BigData. So here we discuss the

different scheduling methods.


What is BigData…?

Today we live in the data age.

Every day, we create 2.5 quintillion bytes of data, 90% of

this data is unstructured.

90% of the data in the world today has been created in the

last two years alone .

By the end of 2015, CISCO estimate that global Internet

traffic will reach 4.8 zettabytes a year.

Ex. Social Networking Sites, Airlines, Healthcare

Departments, Satellites,


How is the BigData Generates…?


What is Apache Hadoop…?

Apache Hadoop is an open-source software

framework.

A platform to manage Big Data.

Its not only a tool, It’s a Framework of Tools.

Most Important Hadoop subprojects:

i. HDFS: Hadoop Distributed File System

ii. MapReduce: A Programming Model



Architecture of Hadoop

Why only Hadoop…?

It is Schema-less, but RDBMS is Schema-based.

Handles large volumes of unstructured data easily.

Hadoop is designed to run on cheap commodity

hardware.

Automatically handles data replication and node

failure.

Moving Computation is cheaper than moving Data.

Last but not the least – Its Free…! (Open source)


What is Hadoop HDFS…?

Inspired by Google File System.

It’s a Scalable, distributed, reliable file system

written in Java for Hadoop framework.

An HFDS cluster primarily consists of:

i. NameNode

ii. DataNode

Stores very large files in blocks across machines in

a large Cluster, deployed on low-cost hardware.


What is MapReduce…?

A software framework for distributed processing of

large data sets on computer clusters.

First developed by Google.

Intended to facilitate and simplify the processing of

vast amounts of data in parallel on large clusters of

commodity hardware in a reliable, fault-tolerant

manner.

It includes JobTracker and TaskTracker.


Typical Hadoop cluster integrates MapReduce and HFDS



Example: WordCount

Challenges of MapReduce

Job Scheduling problems

As the number and variety of jobs to be executed across

heterogeneous clusters are increasing, so is the complexity of

scheduling them efficiently to meet required objectives of

performance.

Energy Efficiency Problems

The size of the clusters is usually in hundreds and

thousands, thus there is a need to look at energy efficiency of

MapReduce clusters.


Literature Survey

Hadoop MapReduce Scheduling methods can be categorized

based on their runtime behavior as follows.

Adaptive (Dynamic) Algorithms

These methods uses the previous, current and/or

future values of parameters to make scheduling decisions.

Ex. Fair, Capacity, Throughput scheduler etc.

Non- adaptive (Static) Algorithms

These methods does not take into consideration the

changes taking place in environment and schedules job/tasks as

per a predefine policy/order.

EX. FIFO (First In First Out).


Survey of Scheduling Methods on Proposed Papers


[1]. Survey of Task Scheduling Methods for MapReduce Framework in Hadoop.

This paper discusses about the survey of various earlier

scheduling methods which have been proposed.

These scheduling methods include-

First In First Out scheduler,

Fair Scheduler,

Capacity Scheduler,

LATE scheduler,

Deadline constraint scheduler,

Etc.,


[1]. Conclusion and future scope

By achieving data locality in the MapReduce framework

performance can be improved.

Finally they concluded with how we can consider the

scheduling methods in Hadoop heterogeneous clusters.


[2]. Perform Wordcount MapReduce Job in Single Node Apache Hadoop Cluster & Compress Data Using LZO Algorithm.

Applications like Yahoo, Facebook, and Twitter have huge

data which has to be stored and retrieved as per client

access.

This huge data storage requires huge database leading to

increase in physical storage and becomes complex for

analysis required in business growth.

Lempel-Ziv-Oberhumer (LZO) algorithm, is used to

compress the redundant data.

LZO algorithm is developed by considering the “Speed as

the Priority”.



LZO algorithm compress the file 5 times faster than the

gzip format.

Decompression ratio of LZO algorithm is 2 times the faster

than gzip format.

Size of the LZO file is slightly larger than the gzip file after

the compression.

Compressed file using LZO or gzip format is very much

smaller than the original file.

In future we can implement this in heterogeneous

multinode clusters.


[3]. S3: An Efficient Shared Scan Scheduler on MapReduce Framework.

To improve performance, multiple jobs operating on a common

data file can be processed as a batch to share the cost of

scanning the file.

Jobs often do not arrive at the same time.

S3 operates like this: At the same time-

System may be processing a batch of sub-jobs,

Also there are sub-jobs which are waiting in job-queue,

As a new job arrives,

Its sub-jobs can be aligned with waiting jobs in job-queue,

Once the current-batch of sub-jobs completes processing-

Then next batch of sub-jobs is initiated for processing.



S3 can exploit the sharing of data scan to improve

performance.

Unlike existing batch-based schedulers S3 allows jobs to

be processed as they arrive, and arriving job does not

need to wait for long time.

More computational policies such as computational

resources and job priorities can be added to S3 to make

more flexible.


[4]. Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize their Makespan and Improve Cluster Performance.

This paper proposes the key- challenge to increase the

utilization of MapReduce clusters.

Here the goal is to automate the design of a job schedule

that minimizes the completion- time or deadline of

MapReduce jobs.

A novel abstraction framework and a heuristic called

BalancedPools are discussed.



They have simulated the things over a realistic workload

and observed that 15%-38% completion-time

improvements.

This shows that, the order in which jobs executed can have

significant impact on their overall completion-time and the

cluster resource utilization.

Future step may include addressing a more general

problem of minimizing the deadline of batch workloads.


[5]. ThroughputScheduler: Learning to Schedule on Heterogeneous Hadoop Clusters.

Presently available schedulers for Hadoop clusters assign

tasks to nodes without regard to the capability of the nodes.

This paper proposes a method, which reduces the overall job

completion time on a cluster of heterogeneous nodes by

actively scheduling tasks on nodes based on optimally

matching job requirements to node capabilities.

Node capabilities are learned by running probe jobs on the

cluster.

Bayesian active learning scheme is used to learn source

requirements of jobs on-the-fly.



The framework learns both server capabilities and job task

parameters autonomously.

ThroughputScheduler can reduce total job completion time

by almost 20% compared to the Hadoop Fair Scheduler

and 40% compared to FIFO Scheduler.

ThroughputScheduler also reduces average mapping time

by 33% compared to either of these schedulers.


Conclusion

Local data processing takes lesser time as compared to

moving the data across network. So to improve the

performance of jobs, most of the algorithms work to improve

the data locality. To meet the user expectations, scheduling

algorithms must use prediction methods based on the volume of

data to be processed and underlying hardware. So as a future

work we can consider developing the algorithms which can

schedule the jobs efficiently on heterogeneous clusters.


References

[1]. J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters.” Proc. Sixth Symp. Operating System Design and Implementation, San Francisco, CA, Dec. 6-8, Usenix, 2004. [2]. Lei Shi, Xiaohui Li, Kian-Lee Tan, “S3: An Efficient Shared Scan Scheduler on MapReduce Framework.”, School of Computing National University of Singapore, comp.nus.edu.sg, 2012. [3]. Dr. Umesh Bellur, Nidhi Tiwari, “Scheduling and Energy Efficiency Improvement Techniques for Hadoop MapReduce: State of Art and Directions for Future Research.”, Department of Computer Science and Engineering Indian Institute of Technology, Mumbai. [4]. Abhishek Verma, Ludmila Cherkasova, Roy H. Campbell, “Two Sides of a Coin: Optimizing the Schedule of MapReduce Jobs to Minimize Their Makespan and Improve Cluster Performance.”, HP Labs. Supported in part by Air Force Research grant FA8750-11-2-0084. [5]. Nandan Mirajkar, Sandeep Bhujbal, Aaradhana Deshmukh, “Perform Wordcount MapReduce Job in Single Node Apache Hadoop Cluster and Compress Data Using Lempel-Ziv-Oberhumer (LZO) Algorithm.”, Department of Advanced Software and Computing Technologies IGNOU –I2IT, Centre of Excellence for Advanced Education and Research Pune, India.


References continued…

[6]. Houvik B Ardhan, Daniel A. Menasce. “The Anatomy of MapReduce Jobs, Scheduling, and Performance Challenges”, Proceedings of the 2013 Conference of the Computer Measurement Group, San Diego, CA, November 5-8, 2013. [7]. Shekhar Gupta, Christian Fritz, Bob Price, Roger Hoover, and Johan de Kleer, “ThroughputScheduler: Learning to Schedule on Heterogeneous Hadoop Clusters”, USENIX Association, 10th International Conference on Autonomic Computing (ICAC 2013). [8]. Nilam Kadale, U. A. Mande, “Survey of Task Scheduling Method for MapReduce Framework in Hadoop.”, 2nd National Conference on Innovative Paradigms in Engineering & Technology (NCIPET 2013). [9]. Tom Wille, “Hadoop: The Definitive Guide.” 2nd edition, O’Reilly publications, Sebastopol, CA 95472. October 2010. [10]. J Jeffery Hanson. “An Introduction to the Hadoop Distributed File System.” IBM DeveloperWorks, 2011.


Thank You All…!!!


bigdata- survey on scheduling methods in hadoop mapreduce

Education

hadoop mapreduce framework

hadoop framework

kulkarni ait

hadoop hdfs

apache hadoop

mapreduce literature

different scheduling

data locality