hadoop mapreduce programmers perspective

18
Hadoop MapReduce Programmers perspective HAMS Technologies 1 HAMS Technolo www.hams.co.in [email protected] [email protected] [email protected]

Upload: jun

Post on 22-Feb-2016

32 views

Category:

Documents


0 download

DESCRIPTION

Hadoop MapReduce Programmers perspective. HAMS Technologies www.hams.co.in [email protected] [email protected] [email protected]. Hadoop overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 2: Hadoop  MapReduce  Programmers perspective

HAMS Technologies

2

» A framework that lets one easily write and run applications that process vast amounts of data. It includes terminology like: MapReduce, HDFS, Hive, Hbase, Pig.

» Yahoo is the biggest contributor. Other major contributor are Facebook, Google, Amazon/A9.

» Here's what makes it especially useful:

Scalable and reliable Easy of implementation Efficient Lots of tool available Supporting many well known languages and scripts.

Hadoop overview

Page 3: Hadoop  MapReduce  Programmers perspective

3

HAMS Technologies

How Hadoop works ?• MapReduce divides applications into small blocks of work. • HDFS creates desire replicas of data blocks for reliability, placing them on

compute nodes around the cluster. • MapReduce can then process the data locally followed by aggregation of

intermediate result .

Page 4: Hadoop  MapReduce  Programmers perspective

4

HAMS Technologies

General flow in MapReduce architecture

1. Create a clustered network.2. Load the data into cluster using Map (mapper task).3. Fetch the processing data with help of Map (mapper task).4. Aggregate the result with Reducer ( Reducer task).

Local Data Local Data Local Data

Partial Result-1

Partial Result-2

Partial Result-3

Map Map Map

Reduce Aggregated Result

Page 5: Hadoop  MapReduce  Programmers perspective

5

HAMS Technologies

General attributes of in MapReduce architecture

1. Distributed file system (DFS)2. Data locality3. Data redundancy for fault tolerance 4. Map tasks applied to partitioned data it scheduled so that input blocks are

on same machine.5. Reducer tasks applied to process data partitioned by MAP task.

Local Data Local Data Local Data

Partial Result-1

Partial Result-2

Partial Result-3

Map Map Map

Reduce Aggregated Result

Page 6: Hadoop  MapReduce  Programmers perspective

6

HAMS Technologies

Hadoop is an open source implementation of MapReduced architecture maintained by Apache

Hadoop

HDFSHadoop Distributed file system

MapReduceJob trackers

name node/s

Data node/s Job tracker node/s

Data NodeData node/s

Tracker node/s

Data NodeData node/s

Tracker node/s

Data NodeData node/s

Tracker node/s

Master nodes

Slavenodes

Hive(HadoopinteractIVE)

Page 7: Hadoop  MapReduce  Programmers perspective

7

HAMS Technologies

» Hadoop-streaming allow to create and run MapReducde job as Mapper and/or as Reducer.

» HDFS (Hadoop Distributed File System) is a clustered network used to store data. HDFS contain the script to replicate and track the different data blocks. HDFS write is show below. In same reverse manner we retrieve data from HDFS.

hams.txtBlock-1

Block-2

Block-3 Name Node

Data Node-1

Data node/s

Tracker node/s

Data Node-2

Data node/s

Tracker node/s

Data Node-3

Data node/s

Tracker node/s

Data Node-n

Data node/s

Tracker node/s

12

33 3

I am having a file contains 3 blocks.. Where should I write

these? Okey, Write these on data-node 1 ,2

and 3

Page 8: Hadoop  MapReduce  Programmers perspective

8

HAMS Technologies

• Unstructured data for analysis

• Very large amount of data

• Write ones (less), read many

• Multiple modules written in different languages

When to use Hadoop

Page 9: Hadoop  MapReduce  Programmers perspective

9

HAMS Technologies

1. Hadoop Admin/Technical person : People who configure the Hadoop environment, setting required number of cluster with detail of all data source and different nodes

2. Hadoop programmer : People who write the different map reduce function to perform the data analysis.

*Here we are taking the perspective of Hadoop programmer.

Kind of people working in development of Application using Hadoop

Page 10: Hadoop  MapReduce  Programmers perspective

10

HAMS Technologies

Map/Reduce is a programming model for efficient distributed computingIt works like a Unix pipeline:

Unix -> cat input | grep | sort | uniq -c | cat > output Hadoop-> Input | Map | Shuffle & Sort | Reduce | Output

A simple model but good for a lot of applicationsLog processing.Web index building.Count of URL Access Frequency

ReverseWeb-Link Graph: list of all source URLs associated with a given target URLInverted index: Produces <word, list(Document ID)> pairsDistributed sort

Page 11: Hadoop  MapReduce  Programmers perspective

11

HAMS Technologies

Page 12: Hadoop  MapReduce  Programmers perspective

12

HAMS Technologies

Here we need to take care the implementation of Map and reduce function and need to write code for launching the application

MapperInput: value: lines of text of inputOutput: key: word, value: 1

ReducerInput: key: word, value: set of countsOutput: key: word, value: sum

Launching programDefines the jobSubmits job to cluster

Page 13: Hadoop  MapReduce  Programmers perspective

13

HAMS Technologies

Mapper ( example for word count)

public static class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line,"\t"); //System.out.println(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } }

Page 14: Hadoop  MapReduce  Programmers perspective

14

HAMS Technologies

Reducer ( example for word count)

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); }

Page 15: Hadoop  MapReduce  Programmers perspective

15

HAMS Technologies

Map reduce launcherConfiguration conf = new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(WordCountMap.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[1])); FileOutputFormat.setOutputPath(job, new Path(args[2])); job.waitForCompletion(true);

Page 16: Hadoop  MapReduce  Programmers perspective

16

HAMS Technologies

Running the complete program

• Build the jar file either directly using eclipse or by jar command.

• Configure the Hadoop.

• Place the jar file in appropriate location.

• Lets move to the Demo : )

Page 17: Hadoop  MapReduce  Programmers perspective

17

HAMS Technologies

Documentation :

• Hadoop Wiki– Introduction

• http://hadoop.apache.org/core/– Getting Started

• http://wiki.apache.org/hadoop/GettingStartedWithHadoop– Map/Reduce Overview

• http://wiki.apache.org/hadoop/HadoopMapReduce– DFS

• http://hadoop.apache.org/core/docs/current/hdfs_design.html• Javadoc

– http://hadoop.apache.org/core/docs/current/api/index.html

Page 18: Hadoop  MapReduce  Programmers perspective

18

HAMS Technologies

Thank you

Kindly drop us a mail at below mention address for any suggestion and clarification. We like to hear from you

HAMS [email protected]@[email protected]