mapreduce & hadoop it332 distributed systems. outline mapreduce hadoop cloudera hadoop ...

13
MapReduce & Hadoop IT332 Distributed Systems

Upload: walter-daniels

Post on 17-Jan-2018

235 views

Category:

Documents


0 download

DESCRIPTION

MapReduce  MapReduce is a programming model for data processing  The power of MapReduce lies in its ability to scale to 100s or 1000s of computers, each with several processor cores  How large an amount of work?  Web-Scale data on the order of 100s of GBs to TBs or PBs  It is likely that the input data set will not fit on a single computer’s hard drive  Hence, a distributed file system (e.g., Google File System- GFS) is typically required 3

TRANSCRIPT

Page 1: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

MapReduce & Hadoop

IT332Distributed Systems

Page 2: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

Outline MapReduce Hadoop Cloudera Hadoop Tutorial

2

Page 3: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

MapReduce MapReduce is a programming model for data processing

The power of MapReduce lies in its ability to scale to 100s or 1000s of computers, each with several processor cores

How large an amount of work?

Web-Scale data on the order of 100s of GBs to TBs or PBs

It is likely that the input data set will not fit on a single computer’s hard drive

Hence, a distributed file system (e.g., Google File System- GFS) is typically required

3

Page 4: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

MapReduce Characteristics MapReduce ties smaller and more reasonably priced machines

together into a single cost-effective commodity cluster

MapReduce divides the workload into multiple independent tasks and schedule them across cluster nodes

A work performed by each task is done in isolation from one another

4

Page 5: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

Data Distribution In a MapReduce cluster, data is distributed to all the nodes of the cluster

as it is being loaded in

An underlying distributed file systems (e.g., GFS) splits large data files into chunks which are managed by different nodes in the cluster

Even though the file chunks are distributed across several machines, they form a single namesapce

5

Input data: A large file

Node 1

Chunk of input data

Node 2

Chunk of input data

Node 3

Chunk of input data

Page 6: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

MapReduce: A Bird’s-Eye View In MapReduce, chunks are processed in

isolation by tasks called Mappers

The outputs from the mappers are denoted as intermediate outputs (IOs) and are brought into a second set of tasks called Reducers

The process of bringing together IOs into a set of Reducers is known as shuffling process

The Reducers produce the final outputs (FOs)

Overall, MapReduce breaks the data flow into two phases, map phase and reduce phase

C0 C1 C2 C3

M0 M1 M2 M3

IO0 IO1 IO2 IO3

R0 R1

FO0 FO1

chunks

mappers

Reducers

Map

Pha

seR

educ

e Ph

ase

Shuffling Data

Page 7: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

Keys and Values The programmer in MapReduce has to specify two functions, the map

function and the reduce function that implement the Mapper and the Reducer in a MapReduce program

In MapReduce data elements are always structured as key-value (i.e., (K, V)) pairs

The map and reduce functions receive and emit (K, V) pairs

(K, V) Pairs Map Function

(K’, V’) Pairs

Reduce Function

(K’’, V’’) Pairs

Input Splits Intermediate Outputs Final Outputs

Page 8: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

Hadoop Hadoop is an open source implementation of MapReduce and is

currently enjoying wide popularity

Hadoop presents MapReduce as an analytics engine and under the hood uses a distributed storage layer referred to as Hadoop Distributed File System (HDFS)

HDFS mimics Google File System (GFS)

8

Page 9: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

Cloudera Hadoop

Page 10: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

Cloudera Virtual Manager• Cloudera VM contains a single-node Apache Hadoop cluster

along with everything you need to get started with Hadoop.

• Requirements:– A 64-bit host OS– A virtualization software: VMware Player, KVM, or VirtualBox.

• Virtualization Software will require a laptop that supports virtualization. If you are unsure, one way this can be checked by looking at your BIOS and seeing if Virtualization is Enabled.

– A 4 GB of total RAM. • The total system memory required varies depending on the size

of your data set and on the other processes that are running.

Page 11: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

Installation

• Step#1: Download & Run Vmware

• Step#2: Download Cloudera VM

• Step#3: Extract to the Cloudera folder.• Step#4: Open the "cloudera-quickstart-vm-

4.4.0-1-vmware"

Page 12: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

Once you got the software installed, fire up the VirtualBox image of Cloudera QuickStart VM and you

should see the initial screen similar to below:

Page 13: MapReduce & Hadoop IT332 Distributed Systems. Outline  MapReduce  Hadoop  Cloudera Hadoop  Tutorial 2

WordCount Tutorial

• This example computes the occurrence frequency of each word in a text file.

• Steps:1. Set up Hadoop environment2. Upload files into HDFS3. Executing Java MapReduce functions in Hadoop

• Tutorial: http://edataanalyst.com/2013/08/the-definitive-cloudera-hadoop-wordcount-tutorial/