cloud computing mapreduce (2) keke chen. outline hadoop streaming example hadoop java api...

28
Cloud Computing Mapreduce (2) Keke Chen

Upload: linette-hollie-johnston

Post on 17-Jan-2018

218 views

Category:

Documents


0 download

DESCRIPTION

A nice book  Hadoop: The definitive Guide You can read it online from campus network - ohiolink  ebook center  safari online

TRANSCRIPT

Page 1: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Cloud ComputingMapreduce (2)

Keke Chen

Page 2: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Outline Hadoop streaming example Hadoop java API

Framework important APIs

Mini-project

Page 3: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

A nice book Hadoop: The definitive Guide

You can read it online from campus network- ohiolink ebook center safari online

Page 4: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Hadoop streaming Simple and powerful interface for

programming Application developers do not need to learn

hadoop java APIs Good for simple, adhoc tasks

Page 5: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Note: Map/Reduce uses the local linux file

system for processing and hosting temporary data

HDFS is used to host application data

HDFS

Node Local filesystem

Page 6: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Hadoop streamining http://hadoop.apache.org/common/docs/c

urrent/streaming.html /usr/local/hadoop/bin/hadoop jar \

/usr/local/hadoop/hadoop-streaming-1.0.3.jar \-input myInputDirs -output myOutputDir \-mapper myMapper -reducer myReducer

Reducer can be empty: -reducer None myMapper and myReducer can be any

executable Mapper/reducer will take stdin and output to

stdout Files in myInputDirs are fed into mapper as stdin Mapper’s output will be the input of reducer

Page 7: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Packaging files with job submission /usr/local/hadoop/bin/hadoop jar \

/usr/local/hadoop/hadoop-streaming-1.0.3.jar \ -input “/user/hadoop/inputdata” \ -output “/user/hadoop/outputdata” \ -mapper “python myPythonScript.py myDictionary.txt” \ -reducer “/bin/wc” \ -file myPythonScript.py \ -file myDictionary.txt -file is good for small files

Input parameterfor the script

Page 8: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Using hadoop library classeshadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -D mapred.reduce.tasks=12 \ -input myInputDirs \ -output myOutputDir \ -mapper org.apache.hadoop.mapred.lib.IdentityMapper \ -reducer org.apache.hadoop.mapred.lib.IdentityReducer \ -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

Page 9: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Large files and archives Upload large files to HDFS first Use –files option in streaming, which will

download files to local working directory -files

hdfs://host:fs_port/user/testfile.txt#testlink -archives

hdfs://host:fs_port/user/testfile.jar#testlink Cache1.txt, cache2.txt are in testfile.jar Then, locally testlink/cache1.txt, textlink/cache2.txt

Page 10: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Wordcount Problem: counting frequencies of words

for a large document collection. Implement mapper and reducer

respectively, using python Some good python tutorials at

http://wiki.python.org/

Page 11: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Mapper.pyimport sys

for line in sys.stdin: line = line.strip()

words = line.split()for word in words:

print ‘%s\t1’ % (word)

Page 12: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Reducer.pyimport sys

word2count={}

for line in sys.stdin: line = line.strip() word, count = line.split(‘\t’, 1) try:

count = int(count)word2count[word] = word2count.get(word, 0)+ count

except ValueError: pass

for word in word2count:print ‘%s\t%s’% (word, word2count[word])

Page 13: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Running wordcount

hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-mapper "python mapper.py" \ -reducer "python reducer.py" \ -input text -output output2 \ -file /localpath/mapper.py -file

/localpath/reducer.py

Page 14: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Running wordcounthadoop jar $HADOOP_HOME/hadoop-

streaming.jar \ -mapper "python mapper.py" \ -reducer "python reducer.py" \ -input text -output output2 \ -file mapper.py -file reducer.py \ -jobconf mapred.reduce.tasks=2 \ -jobconf mapred.map.tasks=4

Page 15: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

If mapper/reducer takes files as parameters

hadoop jar $HADOOP_HOME/hadoop-streaming.jar \

-mapper "python mapper.py" \ -reducer "python reducer.py myfile" \ -input text -output output2 \ -file /localpath/mapper.py -file

/localpath/reducer.py -file /localpath/myfile

Page 16: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Hadoop Java APIs hadoop.apache.org/common/docs/

current/api/ benefits

Jave code is more efficient than streaming More parameters for control and tuning Better for iterative MR programs

Page 17: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Important base classes Mapper<keyIn, valueIn, keyOut,

valueOut> Function map(Object, Writable, Context)

Reducer<keyIn, valueIn, keyOut, valueOut> Function reduce(WritableComparable,

Iterator, Context) Combiner Partitioner

Page 18: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

The frameworkpublic class Wordcount{

public static class MapClass extends Mapper<Object, Text, Text, LongWritable> {

public void setup(Mapper.Context context){…} public void map(Object key, Text value, Context context) throws IOException {…} }

public static class ReduceClass Reducer<Text, LongWritable, Text, LongWritable> { public void setup(Reducer.Context context){…}

public void reduce(Text key, Iterator<LongWritable> values, Context context) throws IOException{…}}

public static void main(String[] args) throws Exception{}}

Page 19: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

The wordcount example in java http://hadoop.apache.org/common/docs/

current/mapred_tutorial.html#Example%3A+WordCount+v1.0

Old/New framework Old framework for version prior to 0.20

Page 20: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Mapper of wordcount

public static class WCMapper extends Mapper<Object, Text, Text, IntWritable>{

private final static IntWritable one = new IntWritable(1); private Text word = new Text();

public void map(Object key, Text value, Context context ) throws IOException, InterruptedException { StringTokenizer itr = new StringTokenizer(value.toString()); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } }

Page 21: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

WordCount Reducer public static class WCReducer extends Reducer<Text,IntWritable,Text,IntWritable> { private IntWritable result = new IntWritable();

public void reduce(Text key, Iterable<IntWritable> values, Context context ) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } result.set(sum); context.write(key, result); } }

Page 22: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Function parameters Define map/reduce parameters

according to your application Have to use writable classes in

org.apache.hadoop.io E.g. Text, LongWritable, IntWritable etc.

Template parameters and the function parameters should be matched

Map’s output and reduce’s input parameters should be matched.

Page 23: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Configuring map/reduce Passing global parameter settings to

each map/reduce process In main function, set parameters in a

Configuration object Configuration conf = new Configuration(); Job job = new Job(conf, "cloudvista");

job.setJarByClass(Wordcount.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(LongWritable.class);

job.setMapperClass(WCMapper.class); //job.setCombinerClass(WCReducer.class); job.setReducerClass(WCReducer.class); //job.setPartitionerClass(WCPartitioner.class);

job.setNumReduceTasks(num_reduce); FileInputFormat.setInputPaths (job, input); FileOutputFormat.setOutputPath (job, new Path(output_path )); System.exit(job.waitForCompletion(true)?0:1);

Page 24: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

How to run your app1. Compile to jar file2. Command line hadoop jar your_jar your_parameters

Normally you need to pass in Number of reducers Input files Output directory Any other application specific parameters

Page 25: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Access Files in HDFS?Example: In map function

Public void setup(Mapper.Context context){ Configuration conf = context.getConfiguration(); string filename = conf.get(“yourfile");

Path p = new Path(filename); // Path is used for opening the file.

FileSystem fs = FileSystem.get(conf);//determines local or HDFS

FSInputStream file = fs.open(p);

while (file.available() > 0){…

} file.close();

}

Page 26: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Combiner Apply reduce function to the intermediate

results locally after the map generates the result

Map1key1

Key n

combine Key1, value1Key2, value2…Keyn, valueN

reduces

Map’s local

Page 27: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Partitioner If map’s output will generate N keys

(N>R, R:# of reduces) By default, N keys are randomly distributed

to R reduces You can use partitioner to define how the

keys are distributed to the reduces.

Page 28: Cloud Computing Mapreduce (2) Keke Chen. Outline  Hadoop streaming example  Hadoop java API Framework important APIs  Mini-project

Mini project 11. Learn to use HDFS2. Read and run wordcount example

http://hadoop.apache.org/common/docs/r0.20.2/mapred_tutorial.html

3. Write a MR program for inverted-index /user/hadoop/prj1.txt Implement two versions

Script/exe + streaming Hadoop Java API

The file has “docID \t docContent” per line Generating inverted index

Word \t a list of “DocID:position”