hadoop mapreduce

21
HADOOP MAPREDUCE Darwade Sandip MNIT Jaipur December 25, 2013 Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 1 / 21

Upload: sandip-darwade

Post on 10-May-2015

345 views

Category:

Engineering


6 download

DESCRIPTION

Big data Analysis using Hadoop Mapreduce

TRANSCRIPT

Page 1: Hadoop Mapreduce

HADOOP MAPREDUCE

Darwade Sandip

MNIT Jaipur

December 25, 2013

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 1 / 21

Page 2: Hadoop Mapreduce

Outline

What is HADOOP

What is MapReduce

Componants of Hadoop

Architecture

Implementation

Bibliography

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 2 / 21

Page 3: Hadoop Mapreduce

What is Hadoop?

The Apache Hadoop software library is a framework thatallows for the distributed processing of large data setsacross clusters of computers using simple programmingmodels.

Hadoop is best known for MapReduce and its distributedfilesystem (HDFS),and large-scale data processing.

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 3 / 21

Page 4: Hadoop Mapreduce

MapReduce

Programming model for data processing

Hadoop can run MapReduce programs written in variouslanguages Java,Python

Parallel Processing,put Mapreduce in very large-scaledata analysis

Mapper produce intermediate results

Reducer aggregates the results

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 4 / 21

Page 5: Hadoop Mapreduce

Componants Of Hadoop

Two Main Components of Hadoop

HDFS

MAPREDUCE

HDFS

Files are stored in HDFS and divided into blocks, which are thencopied to multiple Data Nodes

Hadoop cluster contains only one Name Node and manyDataNodes

Data blocks are replicated for High Availability and fast access

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 5 / 21

Page 6: Hadoop Mapreduce

HDFS

NameNode

Run on a separate machine

Manage the file system namespace,and control access of external clients

Store file system Meta-data in memory

File information, each block information of files,and every file block information in Data Node

DataNode

Run on Separate machine,which is the basic unit of file storage

Sent all messages of existing Blocks periodically to Name Node

Data Node response read and write request from the Name Node,and also respond, create, delete, and copy the block commandfrom Name Node

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 6 / 21

Page 7: Hadoop Mapreduce

MapReduce

Files are split into fixed sized blocksand stored on data nodes (Default 64MB)

Programs written, can process on distributed clusters in parallel

Input data is a set of key / value pairs, the output is also the key /value pairs

Mainly Two Phase Map and Reduce

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 7 / 21

Page 8: Hadoop Mapreduce

MapReduce (continue...)

Figure: MapReduce Process Architecture

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 8 / 21

Page 9: Hadoop Mapreduce

MapReduce (continue...)

Map

Map process each block separately in parallel

Generate an intermediate key/value pairs set

Results of these logic blocks are reassembled

Reduce

Accepts an intermediate key and related value

Processed the intermediate key and value

Form a set of relatively small value set

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 9 / 21

Page 10: Hadoop Mapreduce

How Hadoop runs a MapReduce.

The client, which submits the MapReduce job.

The jobtracker, which coordinates the job.

The tasktrackers, which run the tasks that the job hasbeen split into.

Tasktrackers are Java applications whose main class isTaskTracker.

The distributed filesystem, which is used for sharing jobfiles between the other entities.

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 10 / 21

Page 11: Hadoop Mapreduce

How Hadoop runs a MapReduce.

Job Submission

Job Initialization

Task Assignment

Task Execution

Job Completion

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 11 / 21

Page 12: Hadoop Mapreduce

How Hadoop runs a MapReduce

Figure: How Hadoop runs a MapReduce job using the classic framework

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 12 / 21

Page 13: Hadoop Mapreduce

How Hadoop runs a MapReduce.

Job Submission

submit() method creates an internal JobSummitter callssubmitJobInternal()

The job, waitForCompletion() polls the jobs progress once persecondJobSummitter does

Asks the jobtracker for a new job ID (by calling getNewJobId() onJobTracker

Checks the output specification of the job

Computes the input splits for the job.

Copies the resources.

Tells the jobtracker that the job is ready for execution by callingsubmitJob() .

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 13 / 21

Page 14: Hadoop Mapreduce

How Hadoop runs a MapReduce.

Job Initialization

When the JobTracker receives a call submitJob(), it puts it into aninternal queue.

retrieves the input splits computed by the client from the sharedfilesystem

Job Assignment

Tasktrackers periodically sends heartbeat.

Assign task to Tasktracker

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 14 / 21

Page 15: Hadoop Mapreduce

How Hadoop runs a MapReduce.

Job Execution

Next step for the TaskTracker is to run the task.

It localizes the job JAR by copying it from local HDFS

Creates an instance of TaskRunner to run the task.

Job completion

When the jobtracker receives a notification that the last task for ajob is complete, it changes the status for the job to ”successful”.

And tell the user that it returns from the waitForCompletion()method.

The jobtracker cleans up its working state

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 15 / 21

Page 16: Hadoop Mapreduce

Implementation

Figure: Minimum Tempurature

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 16 / 21

Page 17: Hadoop Mapreduce

Implementation

Figure: Maximum Tempurature

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 17 / 21

Page 18: Hadoop Mapreduce

Implementation (continue...)

Figure: Word Count

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 18 / 21

Page 19: Hadoop Mapreduce

Implementation (continue...)

Figure: Word Count

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 19 / 21

Page 20: Hadoop Mapreduce

Bibliography I

G. Yang, “The application of mapreduce in the cloud computing,” IntelligenceInformation Processing and Trusted Computing (IPTC) 2011, vol. 9,pp. 154–156, Oct 2011.

X. Zhang, G. Wang, Z. Yang, and Y. Ding, “A two-phase execution engine ofreduce tasks in hadoop mapreduce.,” 2012 International Conference on Systemsand Informatics (ICSAI 2012), pp. 858–864, May 2012.

T. White, Hadoop:The Definitive Guide, Third Edition.

1005 Gravenstein Highway North, Sebastopol, CA 95472: OReilly Media, Inc.,2012.

J. Dean and S. Ghemawat, “Mapreduce: Simplified data processing on largeclusters,” Operating System Design and Implementation (OSDI 2004), vol. 6,pp. 137–150, 2004.

X. Lin, Z. Meng, C. Xu, and M. Wang, “A practical performance model forhadoop mapreduce,” 2012 IEEE International Conference on ClusterComputing Workshops, pp. 231–239, Sept 2012.

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 20 / 21

Page 21: Hadoop Mapreduce

Bibliography II

Z. Gua, M. Pierce, G. Fox, and M. Zhou, “Automatic task re-organization inmapreduce,” 2011 IEEE International Conference on Cluster Computing,pp. 335–343, May 2011.

K. Wang, X. Lin, and W. Tang, “An experience guided configuration optimizerfor hadoop mapreduce,” Cloud Computing Technology and Science(CloudCom), pp. 419–426, Dec 2012.

Darwade Sandip (MNIT) HADOOP MAPREDUCE December 25, 2013 21 / 21