introduction to hadoop-mapreduce platform

45
Introduction to Hadoop-Mapreduce Platform Presented by: Monzur Morshed Habibur Rahman Tiger HATS www.tigerhats.org

Upload: tigerhats

Post on 02-Dec-2014

132 views

Category:

Documents


3 download

DESCRIPTION

Introduction to Hadoop-Mapreduce Platform

TRANSCRIPT

Page 1: Introduction to Hadoop-Mapreduce Platform

Introduction to Hadoop-Mapreduce

Platform

Presented by:

Monzur MorshedHabibur Rahman

TigerHATSwww.tigerhats.org

Page 2: Introduction to Hadoop-Mapreduce Platform

The International Research group dedicated to Theories, Simulation and

Modeling, New Approaches, Applications, Experiences, Development, Evaluations, Education, Human, Cultural and Industrial Technology

 

TigerHATS - Information is power

Page 3: Introduction to Hadoop-Mapreduce Platform

HadoopHadoop is an open source implementation of the MapReduce platform and distributed file system, written in Java. This module explains the basics of how to begin using Hadoop to experiment and learn from the rest of this tutorial. It covers setting up the platform and connecting other tools to use it.

Source: http://developer.yahoo.com/hadoop/tutorial/module3.html

Page 4: Introduction to Hadoop-Mapreduce Platform

What Hadoop is

•Inspired by Google

•Distributed file system similar to Google

File System

•Parallel programming model similar to

Google MapReduce

•Parallel database similar to Google

Bigtable

•Open source Java project

Hadoop was created by Doug Cutting, who named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project.

Page 5: Introduction to Hadoop-Mapreduce Platform

Hadoop

• Distributed file system (HDFS)

• Distributed execution framework (MapReduce)

• Query language (Pig)

• Distributed, column-oriented data store (HBase)

• Machine learning (Mahout)

Page 6: Introduction to Hadoop-Mapreduce Platform

Hadoop Distributed File system

• Cluster filing system

• Designed for huge files (many GBs)

• Designed for lots of streaming reads and

infrequent writes

• Not a POSIX file system: requires client help

Page 7: Introduction to Hadoop-Mapreduce Platform

What Hadoop isn’t

•Hadoop is not a ―classical grid solution

•HDFS is not a POSIX file system

•HDFS is not designed for low latency access to a huge

number of small files

•Hadoop MapReduce is not designed for interactive

applications

•HBase is not a relational database and does not have

transactions or SQL support

•HDFS and HBase are not focused on security, encryption

or multi-tenancy

Page 8: Introduction to Hadoop-Mapreduce Platform

HDFS, MapReduce

Page 9: Introduction to Hadoop-Mapreduce Platform

Typical Hadoop Cluster

Page 10: Introduction to Hadoop-Mapreduce Platform

Commodity Hardware

Typically in 2 level architecture– Nodes are commodity PCs– 30-40 nodes/rack– Uplink from rack is 3-4 gigabit– Rack-internal is 1 gigabit

Page 11: Introduction to Hadoop-Mapreduce Platform

SecondaryNameNode

Client

HDFS Architecture

NameNode

DataNodes

1. filename

2. BlckId, DataNodes

o

3.Read data

Cluster Membership

Cluster Membership

NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on diskSecondaryNameNode: Periodic merge of Transaction log

Page 12: Introduction to Hadoop-Mapreduce Platform
Page 13: Introduction to Hadoop-Mapreduce Platform

Data Flow

Web Servers Scribe Servers

Network Storage

Hadoop ClusterOracle RAC MySQL

Page 14: Introduction to Hadoop-Mapreduce Platform

Image Source: http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

Page 15: Introduction to Hadoop-Mapreduce Platform

Very Large Distributed File System–10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware–Files are replicated to handle hardware failure–Detect failures and recover from them

Optimized for Batch Processing–Data locations exposed so that computations can move to where data resides–Provides very high aggregate bandwidth

User Space, runs on heterogeneous OS

HDFS –Hadoop Distributed File System

Page 16: Introduction to Hadoop-Mapreduce Platform

Data Coherency–Write-once-read-many access model–Client can only append to existing files

Files are broken up into blocks–Typically 128 MB block size–Each block replicated on multiple Data Nodes

Intelligent Client–Client can find location of blocks–Client accesses data directly from Data Node

Distributed File System

Page 17: Introduction to Hadoop-Mapreduce Platform

Simple data-parallel programming model designed for scalability and fault-tolerance

Framework for distributed processing of large data sets

Originally designed by Google

Pluggable user code runs in generic framework

Pioneered by Google -Processes 20 petabytes of data per day

MapReduce Paradigm

Page 18: Introduction to Hadoop-Mapreduce Platform

At Google: - Index construction for Google Search - Article clustering for Google News

- Statistical machine translation

At Yahoo!: - “Web map” powering Yahoo! Search - Spam detection for Yahoo! Mail

At Facebook:- Data mining- Ad optimization- Spam detection

What is MapReduce used for?

Page 19: Introduction to Hadoop-Mapreduce Platform

In research: Astronomical image analysis

(Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) Ocean climate simulation (Washington)

What is MapReduce used for?

Page 20: Introduction to Hadoop-Mapreduce Platform

Mapreduce processing model

Page 21: Introduction to Hadoop-Mapreduce Platform

How the final multi-node cluster will look like

Page 22: Introduction to Hadoop-Mapreduce Platform

Who uses Hadoop?

• Amazon/A9• Facebook• Google• IBM• Joost• Last.fm• New York Times• PowerSet• Veoh• Yahoo!

Page 23: Introduction to Hadoop-Mapreduce Platform

Data type: key-value records

Map function:(Kin, Vin) -> list(Kinter, Vinter)

Reduce function: (Kinter, list(Vinter)) -> list(Kout,

Vout)

MapReduce Programming Model

Page 24: Introduction to Hadoop-Mapreduce Platform

def mapper(line):foreachword in line.split():

output(word, 1)

def reducer(key, values):output(key, sum(values))

Example: Word Count

Page 25: Introduction to Hadoop-Mapreduce Platform
Page 26: Introduction to Hadoop-Mapreduce Platform

Single master controls job execution on multiple slaves

Mappers preferentially placed on same node or same rack as their input block

- Minimizes network usage Mappers save outputs to local disk

before serving them to reducers- Allows recovery if a reducer crashes- Allows having more reducers than

nodes

MapReduce Execution Details

Page 27: Introduction to Hadoop-Mapreduce Platform

1. If a task crashes: Retry on another node• OK for a map because it has no

dependencies• OK for reduce because map outputs are

on disk If the same task fails repeatedly, fail

the job or ignore that input block (user-controlled)

Fault Tolerance in MapReduce

Page 28: Introduction to Hadoop-Mapreduce Platform

2. If a node crashes: Re-launch its current tasks on other

nodes Re-run any maps the node previously

ran Necessary because their output files

were lost along with the crashed node

Fault Tolerance in MapReduce

Page 29: Introduction to Hadoop-Mapreduce Platform

3. If a task is going slowly (straggler):

Launch second copy of task on another node (“speculative execution”)• Take the output of whichever copy finishes first, and kill

the other

Surprisingly important in large clusters

- Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc

- Single straggler may noticeably slow down a job

Fault Tolerance in MapReduce

Page 30: Introduction to Hadoop-Mapreduce Platform

By providing a data-parallel programming model,

MapReduce can control job execution in useful

ways:

Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers

Takeaways

Page 31: Introduction to Hadoop-Mapreduce Platform

1. Search

Input: (lineNumber, line) recordsOutput: lines matching a given pattern

Map: if(line matches pattern):

output(line)

Reduce: identify functionAlternative: no reducer (map-only job)

Some practical MapReduce examples

Page 32: Introduction to Hadoop-Mapreduce Platform

2. Sort

Input: (key, value) recordsOutput: same records, sorted by key

Map: identity functionReduce: identify function

Trick: Pick partitioning function h such that k1<k2=> h(k1)<h(k2)

Some practical MapReduce examples

Page 33: Introduction to Hadoop-Mapreduce Platform

3. Inverted Index

Input: (filename, text) recordsOutput: list of files containing each word

Map: for each word in text.split():

output(word, filename)

Combine: uniquely file names for each word

Reduce:def reduce(word, filenames):

output(word, sort(filenames))

Some practical MapReduce examples

Page 34: Introduction to Hadoop-Mapreduce Platform

Inverted Index Example

Page 35: Introduction to Hadoop-Mapreduce Platform

4. Most Popular Words

Input: (filename, text) recordsOutput: top 100 words occurring in the most files

Two-stage solution:

Job 1:- Create inverted index, giving (word, list(file)) records

Job 2:- Map each (word, list(file)) to (count, word)- Sort these records by count as in sort job

Some practical MapReduce examples

Page 36: Introduction to Hadoop-Mapreduce Platform

Three ways to write jobs in Hadoop:

- Java API- Hadoop Streaming (for Python, Perl, etc)- Pipes API (C++)

MapReduce in Hadoop

Page 37: Introduction to Hadoop-Mapreduce Platform

MapReduce architecture

Page 38: Introduction to Hadoop-Mapreduce Platform
Page 39: Introduction to Hadoop-Mapreduce Platform

Scope of Mapreduce

Page 40: Introduction to Hadoop-Mapreduce Platform

http://developer.yahoo.com/hadoop/tutorial/module3.html

http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html

Hadoop-Mapreduce Tutorial

Page 41: Introduction to Hadoop-Mapreduce Platform

We introduced MapReduce programming model for processing large scale data

We discussed the supporting Hadoop Distributed File System

The concepts were illustrated using a simple example

We reviewed some important parts of the source code for the example.

Summary

Page 42: Introduction to Hadoop-Mapreduce Platform

HDFS is not a POSIX file system but using Gfarm file System instead of HDFS. The Gfarm file system has advantage since it supports not only MapReduce applications but also POSIX and MPI-IO applications.

Ref Article:

a) Hadoop MapReduce on Gfarm File System Download:www.hpcs.cs.tsukuba.ac.jp/~mikami/publications/pragma18.pdf

b) Using the Gfarm File System as a POSIX compatible storage platform for Hadoop MapReduce applications

Download: www.shun0102.net/wp-content/uploads/PID2037887.pdf

Gfarm fi le system for POSIX & MPI-IO support

Page 43: Introduction to Hadoop-Mapreduce Platform
Page 44: Introduction to Hadoop-Mapreduce Platform
Page 45: Introduction to Hadoop-Mapreduce Platform

Thank You