introduction to hadoop-mapreduce platform

Introduction to Hadoop-Mapreduce

Platform

Presented by:

Monzur MorshedHabibur Rahman

TigerHATSwww.tigerhats.org

The International Research group dedicated to Theories, Simulation and

Modeling, New Approaches, Applications, Experiences, Development, Evaluations, Education, Human, Cultural and Industrial Technology

TigerHATS - Information is power

HadoopHadoop is an open source implementation of the MapReduce platform and distributed file system, written in Java. This module explains the basics of how to begin using Hadoop to experiment and learn from the rest of this tutorial. It covers setting up the platform and connecting other tools to use it.

Source: http://developer.yahoo.com/hadoop/tutorial/module3.html

What Hadoop is

•Inspired by Google

•Distributed file system similar to Google

File System

•Parallel programming model similar to

Google MapReduce

•Parallel database similar to Google

Bigtable

•Open source Java project

Hadoop was created by Doug Cutting, who named it after his son's toy elephant. It was originally developed to support distribution for the Nutch search engine project.

Hadoop

• Distributed file system (HDFS)

• Distributed execution framework (MapReduce)

• Query language (Pig)

• Distributed, column-oriented data store (HBase)

• Machine learning (Mahout)

Hadoop Distributed File system

• Cluster filing system

• Designed for huge files (many GBs)

• Designed for lots of streaming reads and

infrequent writes

• Not a POSIX file system: requires client help

What Hadoop isn’t

•Hadoop is not a ―classical grid solution

•HDFS is not a POSIX file system

•HDFS is not designed for low latency access to a huge

number of small files

•Hadoop MapReduce is not designed for interactive

applications

•HBase is not a relational database and does not have

transactions or SQL support

•HDFS and HBase are not focused on security, encryption

or multi-tenancy

HDFS, MapReduce

Typical Hadoop Cluster

Commodity Hardware

Typically in 2 level architecture– Nodes are commodity PCs– 30-40 nodes/rack– Uplink from rack is 3-4 gigabit– Rack-internal is 1 gigabit

SecondaryNameNode

Client

HDFS Architecture

NameNode

DataNodes

1. filename

2. BlckId, DataNodes

3.Read data

Cluster Membership

NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on diskSecondaryNameNode: Periodic merge of Transaction log

Data Flow

Web Servers Scribe Servers

Network Storage

Hadoop ClusterOracle RAC MySQL

Image Source: http://wiki.apache.org/hadoop-data/attachments/HadoopPresentations/attachments/aw-apachecon-eu-2009.pdf

Very Large Distributed File System–10K nodes, 100 million files, 10 PB

Assumes Commodity Hardware–Files are replicated to handle hardware failure–Detect failures and recover from them

Optimized for Batch Processing–Data locations exposed so that computations can move to where data resides–Provides very high aggregate bandwidth

User Space, runs on heterogeneous OS

HDFS –Hadoop Distributed File System

Data Coherency–Write-once-read-many access model–Client can only append to existing files

Files are broken up into blocks–Typically 128 MB block size–Each block replicated on multiple Data Nodes

Intelligent Client–Client can find location of blocks–Client accesses data directly from Data Node

Distributed File System

Simple data-parallel programming model designed for scalability and fault-tolerance

Framework for distributed processing of large data sets

Originally designed by Google

Pluggable user code runs in generic framework

Pioneered by Google -Processes 20 petabytes of data per day

MapReduce Paradigm

At Google: - Index construction for Google Search - Article clustering for Google News

- Statistical machine translation

At Yahoo!: - “Web map” powering Yahoo! Search - Spam detection for Yahoo! Mail

At Facebook:- Data mining- Ad optimization- Spam detection

What is MapReduce used for?

In research: Astronomical image analysis

(Washington) Bioinformatics (Maryland) Analyzing Wikipedia conflicts (PARC) Natural language processing (CMU) Particle physics (Nebraska) Ocean climate simulation (Washington)

What is MapReduce used for?

Mapreduce processing model

How the final multi-node cluster will look like

Who uses Hadoop?

• Amazon/A9• Facebook• Google• IBM• Joost• Last.fm• New York Times• PowerSet• Veoh• Yahoo!

Data type: key-value records

Map function:(Kin, Vin) -> list(Kinter, Vinter)

Reduce function: (Kinter, list(Vinter)) -> list(Kout,

MapReduce Programming Model

def mapper(line):foreachword in line.split():

output(word, 1)

def reducer(key, values):output(key, sum(values))

Example: Word Count

Single master controls job execution on multiple slaves

Mappers preferentially placed on same node or same rack as their input block

- Minimizes network usage Mappers save outputs to local disk

before serving them to reducers- Allows recovery if a reducer crashes- Allows having more reducers than

MapReduce Execution Details

1. If a task crashes: Retry on another node• OK for a map because it has no

dependencies• OK for reduce because map outputs are

on disk If the same task fails repeatedly, fail

the job or ignore that input block (user-controlled)

Fault Tolerance in MapReduce

2. If a node crashes: Re-launch its current tasks on other

nodes Re-run any maps the node previously

ran Necessary because their output files

were lost along with the crashed node

3. If a task is going slowly (straggler):

Launch second copy of task on another node (“speculative execution”)• Take the output of whichever copy finishes first, and kill

the other

Surprisingly important in large clusters

- Stragglers occur frequently due to failing hardware, software bugs, misconfiguration, etc

- Single straggler may noticeably slow down a job

By providing a data-parallel programming model,

MapReduce can control job execution in useful

Automatic division of job into tasks Automatic placement of computation near data Automatic load balancing Recovery from failures & stragglers

Takeaways

1. Search

Input: (lineNumber, line) recordsOutput: lines matching a given pattern

Map: if(line matches pattern):

output(line)

Reduce: identify functionAlternative: no reducer (map-only job)

Some practical MapReduce examples

2. Sort

Input: (key, value) recordsOutput: same records, sorted by key

Map: identity functionReduce: identify function

Trick: Pick partitioning function h such that k1<k2=> h(k1)<h(k2)

3. Inverted Index

Input: (filename, text) recordsOutput: list of files containing each word

Map: for each word in text.split():

output(word, filename)

Combine: uniquely file names for each word

Reduce:def reduce(word, filenames):

output(word, sort(filenames))

Inverted Index Example

4. Most Popular Words

Input: (filename, text) recordsOutput: top 100 words occurring in the most files

Two-stage solution:

Job 1:- Create inverted index, giving (word, list(file)) records

Job 2:- Map each (word, list(file)) to (count, word)- Sort these records by count as in sort job

Three ways to write jobs in Hadoop:

- Java API- Hadoop Streaming (for Python, Perl, etc)- Pipes API (C++)

MapReduce in Hadoop

MapReduce architecture

Scope of Mapreduce

http://developer.yahoo.com/hadoop/tutorial/module3.html

http://v-lad.org/Tutorials/Hadoop/00%20-%20Intro.html

Hadoop-Mapreduce Tutorial

We introduced MapReduce programming model for processing large scale data

We discussed the supporting Hadoop Distributed File System

The concepts were illustrated using a simple example

We reviewed some important parts of the source code for the example.

Summary

HDFS is not a POSIX file system but using Gfarm file System instead of HDFS. The Gfarm file system has advantage since it supports not only MapReduce applications but also POSIX and MPI-IO applications.

Ref Article:

a) Hadoop MapReduce on Gfarm File System Download:www.hpcs.cs.tsukuba.ac.jp/~mikami/publications/pragma18.pdf

b) Using the Gfarm File System as a POSIX compatible storage platform for Hadoop MapReduce applications

Download: www.shun0102.net/wp-content/uploads/PID2037887.pdf

Gfarm fi le system for POSIX & MPI-IO support

Thank You

introduction to hadoop-mapreduce platform

large distributed file

petabytes of data

hadoop isinspired

hadoop isnthadoop

google processes

posix file systemhdfs

google search article

access model client

Documents

hadoop hbase mapreduce

tutorial hadoop hdfs mapreduce

big data- hadoop -mapreduce

big data - hadoop/mapreduce

need for a new processing platform (bigdata) origin of...

mapreduce with hadoop

hadoop: beyond mapreduce

introduction to mapreduce & hadoop

hadoop mapreduce

data management in large-scale distributed systems -...

hadoop ja mapreduce

hadoop mapreduce - 123seminarsonly.com · hadoop mapreduce...

mapreduce programming with apache hadoop -...

hadoop: beyond mapreduce

hadoop mapreduce fundamentals

mapreduce and hadoop

introduction to hadoop and mapreduce

mapreduce and hadoop file...

hadoop mapreduce - osdc fr 2009

beyond hadoop and mapreduce