team3: xiaokui shu, ron cohen cs5604 at virginia tech december 6, 2010

Post on 06-Jan-2018






Click to see full reader


 Is a software framework  User should program  Like a super-library  For distributed applications  Build-in solutions  Solutions depend on this framework  Inspired by Google's MapReduce and Google File System (GFS) papers



Team3: Xiaokui Shu, Ron roncohen@vt.eduCS5604 at Virginia TechDecember 6, 2010


Introduction Hadoop MapReduce

Working With Hadoop Environment MapReduce Programming


Introduction :: Hadoop

Is a software framework User should program Like a super-library

For distributed applications Build-in solutions Solutions depend on this framework

Inspired by Google's MapReduce and Google File System (GFS) papers

Introduction :: Hadoop

Who use Hadoop – Amazon▪ Amazon's product search indices

Adobe▪ 30 nodes running HDFS, Hadoop and Hbase

Baidu▪ handle about 3000TB per week

Facebook▪ store copies of internal log and dimension data

sources, LinkedIn, IBM, Yahoo!, Google…

Introduction :: Hadoop

Hadoop Common HDFS MapReduce ZooKeeper

Introduction :: Hadoop :: IR

Connections to the IR book Ch.4 Index construction▪ Distributed indexing (4.4)

Ch.20 Web crawling and indexes▪ Distributed crawler (20.2)▪ Distributed indexing (20.3)

Introduction :: MapReduce Is a software framework For distributed computing

Mass amount of data Simple processing requirement Portability across variety platforms▪ Clusters▪ CMP/SMP▪ GPGPU

Introduced by Google

Introduction :: MapReduce

Cited from MapReduce: Simplified Data Processing on Large Clusters

Introduction :: MapReduce Map

Map(k1,v1) -> list(k2,v2) Reduce

Reduce(k2, list (v2)) -> list(v3)

Hadoop MapReduce (input) <k1, v1> -> map -> <k2, v2> ->

combine -> <k2, v2> -> reduce -> <k3, v3> (output)

Introduction :: MapReduce Ex Source

$cat file01Hello World Bye World$cat file02Hello Hadoop Goodbye Hadoop$

Introduction :: MapReduce Ex Map Output

For File01< Hello, 1>< World, 1>< Bye, 1>< World, 1>

For File02< Hello, 1>< Hadoop, 1>< Goodbye, 1>< Hadoop, 1>

Introduction :: MapReduce Ex Reduce Output

< Bye, 1>< Goodbye, 1>< Hadoop, 2>< Hello, 2>< World, 2>

Introduction :: MapReduce More input More mappers

Combiner Function after Map More reducers

Partition Function before ReduceFocus on Map & Reduce

Working With Hadoop :: Env

Hadoop in Java (C++) Run in 3 modes

Local (Standalone) Mode Pseudo-Distributed Mode Fully-Distributed Mode

It is setup to Pseudo-Distributed Mode in our instance on IBM cloud

Working With Hadoop

Process1. Start Hadoop service2. Prepare input3. Write your MapReduce program4. Compile your program5. Run your application with Hadoop

Working With Hadoop :: Env Start Hadoop service

$ bin/hadoop namenode -format $ bin/

Initialize filesystem $ bin/hadoop fs -put localdir hinputdir You can also use -get, -rm, -cat with fs

Working With Hadoop :: Env Compile your program & create jar

$ javac -classpath ${HADOOP}-core.jar -d wordcount_classes

$ jar -cvf wordcount.jar -C wordcount_classes/ .

Run your application with Hadoop $ bin/hadoop jar wordcount.jar

org.myorg.WordCount hinputdir houtputdir

Working With Hadoop :: Progvoid map(String name, String document):

// name: document name// document: document contentsfor each word w in document:

EmitIntermediate(w, "1"); void reduce(String word, Iterator partialCounts):

// word: a word// partialCounts: a list of aggregated partial countsint result = 0;for each pc in partialCounts:

result += ParseInt(pc);Emit(AsString(result));

Cited from Wikipedia

Working With Hadoop :: Progpublic static class Map extends MapReduceBase implements

Mapper<LongWritable, Text, Text, IntWritable> {private final static IntWritable one = new IntWritable(1);private Text word = new Text();public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

String line = value.toString();StringTokenizer tokenizer = new

StringTokenizer(line);while (tokenizer.hasMoreTokens()) {

word.set(tokenizer.nextToken());output.collect(word, one);



Working With Hadoop :: Progpublic static class Reduce extends MapReduceBase

implements Reducer<Text, IntWritable, Text, IntWritable> {public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {

int sum = 0;while (values.hasNext()) {

sum +=;}output.collect(key, new IntWritable(sum));


Working With Hadoop :: Prog Configurations & Main class

Leave other work for the Hadoop MapReduce Framework


Hadoop Introduction Connections to the IR book

MapReduce Overview E.g. WordCount Environment configuration Writing your MapReduce application

Refenerce Hadoop Project MapReduce in Hadoop

MapReduce: Simplified Data Processing on Large Clusters

Hadoop Single-Node Setup

Who use Hadoop

Thank You!

top related