google mapreduce framework a summary of: mapreduce & hadoop api slides prepared by peter...

Google MapReduce Framework

A Summary of:

MapReduce

&

Hadoop API

Slides prepared by Peter Erickson ([email protected])

What is MapReduce?

Massively parallel processing of very large data sets (larger than 1TB).

Parallelize the computations over hundreds or thousands of CPUs.

Ensure fault tolerance.

Do this all through an easy-to-use, abstract and reusable framework.

2

What is MapReduce?

Simple algorithm that can be easily parallelized, and efficiently handles large sets of data:

MAP – Sort data in key, value pairs

then

REDUCE – Perform a reduction across n maps (nodes)

- Simple, clean abstraction for programmers.

3

Implementation

Programmers need only to implement two functions:

map ( Object key, Object value )

-> Map<Object, Object>

reduce ( Object key, List<Object> values )

-> Map<Object, Object>

Let's look at an example to better understand these functions.

4

Word Count Example Program

Given a document of words, sum the usage of all words, i.e. (the:42), (is:10), (a:23), (computer:2), etc

How would you perform this task sequentially?

• Map<Word, Integer> maps words to the number of occurrences in the document.

How would you perform this task in parallel?

• Split the document amongst n processors and map words in same way, then reduce all nodes.

5

Word Count: Map function

Map is performed on each node, using a subset of the overall data.

// input_key: document name

// input_value: document contents

Map<String, Integer> map( String input_key, String input_value ) {

// iterate over all words (.split( “\\s” separates a string between spaces)

for ( String word : input_value.split( “\\s” ) {

// insert “1” into the Map for this word (there will be collisions)

output.put( word, "1" );

}

}

6

Word Count: Map function

// input_key: document name

// input_value: document contents

map( String input_key, String input_value ) {

// iterate over all words (.split( “\\s” separates a string between spaces)

for ( String word : input_value.split( “\\s” ) {

// insert “1” into the Map for this word (there will be collisions)

output.put( word, "1" );

}

} - Collisions in the map are IMPORTANT.

- Values with the same key are passed to the Reduce Function.

- The Reduce Function makes the decision of how to merge these

collisions.

7

Reduction Phase

Values in the map are merged in some way using the Reduce Function.

The Reduce Function can add, subtract, multiply, divide, take the average, ignore all data, or anything the programmer chooses.

reduce() is passed a key and a list of values, all that share the same key.

reduce() is expected to return a new Map with reduced values.

8

Reduction Phase

For example, if reduce() is passed:

Key (word): “the”

Values: { 1, 1, 1, 1, 1 }

it would probably form a new summed (key, value) pair to the output map:

Key (word): “the”

Value: 5

9

Reduction Phase

Reduction can happen on any number of nodes before forming a final map:

Node 1 Node 2 Node 3

[“the”, {1, 1, 1}] [“the”, {1, 1, 1, 1}] [“the”, {1}]

------------------REDUCE-----------------

[“the”, {7}] [“the”, {1}]

---------------------REDUCE------------------

Final Result: [“the”, {8}]

10

Word Count: Reduce Function

// key: a word

// values: a list of collided values for this word

Map<String, Integer> reduce( String key, List<Integer> values ) {

// sum the counts for this word

int total = 0;

// iterate over all integers in the collided value list

for ( int count : values ) {

total += count;

}

output.put( key, total );

}11

Hadoop: Java MapReduce Implementation

Hadoop is an open source project run by the Apache Foundation

Provides an API to write MapReduce programs easily and efficiently in Java

Installed on the Thug (Paranoia) cluster at RIT

Used in associated with the Google Filesystem

Hadoop has several frameworks for different parts of the MapReduce paradigm.

12

Hadoop API

A Summary of:

• Input Formats

• Record Readers

• Mapping Function

• Reducing Function

• Output Formats

• General Program Layout

Hadoop Input Formats

Most Hadoop programs read their input from a file.

The data from a Hadoop input file must be parsed as map (key, value) pairs, even before the map() function.

The key and value types from the input file are separate from the key and value types using for map and reduce.

A very simple input format: TextInputFormat

• Key: Line number

• Value:A line of text from the input file

14

Hadoop Data Types

Special data types are used in Hadoop, for the purpose of serialization between nodes, and data comparison:

• IntWritable – integer type

• LongWritable – long integer type

• Text – string type• many, many more

The WordCount program used the TextInputFormat, as we saw in the map() function:

public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter ) throws IOException {

15

Word Count Example Program

public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter ) throws IOException {

key = line number

value = text from document at line number

output = OutputCollector, or special Map that allows collisions• output is written to the output collector

reporter = special Hadoop object for reporting progress• doesn't need to be used – for more advanced programs

16

Hadoop Input Format

The InputFormat object of the program reads data on a single node from a file, or other source.

The InputFormat is then asked to partition the data into smaller sets of data for each node to process.

Let's look at a new sample program, that does not read any input from a file:

• Sum all prime numbers between 1 and 1000.

17

Prime Number Example

We could make a file that lists all numbers from 1 to 1000, but this is unnecessary: we can write our own InputFormat class to generate these numbers and divide them amongst the nodes: PrimeInputFormat

The InputFormat should generate numbers 1 to 1000, then split the numbers into n groups.

InputFormats generate or read (key, value) pairs, so what should we use?

• No need to use a key, we only have values (numbers), so we use a dummy key.

18

InputFormat Interface

// interface to be used with objects of type K and V

public interface InputFormat<K, V> {

// validates the input for the specified job (can be ignored)public void validateInput( JobConf job ) throws IOException;

// returns an array of “InputSplit” objects that are sent to each node;// the number of splits to be made is represented by numSplitspublic InputSplit[] getSplits( JobConf job, int numSplits )

throws IOException;

// returns a “iterator” of sorts for a node to extract (key, value) pairs// from an InputSplitpublic RecordReader<K, V> getRecordReader( InputSplit split,

JobConf job, Reporter reporter ) throws IOException;

}

19

InputSplit Interface

public interface InputSplit extends Writable {

// get total bytes of data in this input splitpublic long getLength();

// get the hostnames of where the splits are (can be ignored)public String[] getLocations();

}

// represents an object that can be serialized/deserialized

public interface Writable {

// read in the fields of this object from the DataInput objectpublic void readFields( DataInput in );

// write the fields of this object to the DataOutput objectpublic void write( DataOutput out );

}

20

Prime Number InputSplit: RangeSplit

An InputSplit for our program need only hold the range of numbers for each node (min, max).

public class RangeSplit implements InputSplit {

int min, max;

public RangeSplit() { super(); }

public long getLength() { return ( long )( max – min ); }

public String[] getLocations() { return new String[]{}; }

public void write( DataOutput out ) {

out.writeInt( min ); out.writeInt( max );

}

public void readFields( DataInput in ) {

min = in.readInt(); max = in.readInt();

}

}

21

PrimeInputFormat.getSplits();

Create RangeSplit objects for our program:

public InputSplit[] getSplits( JobConf job, int numSplits ) throws IOException {

RangeSplit[] splits = new RangeSplit[ numSplits ];

// for simplicity sake, we're going to assume 1000 is evenly divisible

// by numSplits, but this may not always be the case

int len = 1000 / numSplits;

for( int i = 0; i < n; i++ ) {

splits[ i ].min = ( i * len ) + 1;

splits[ i ].max = ( i + 1 ) * len;

} // for

return splits;

} // getSplits

22

Record Reader

An InputSplit for our program generates range of numbers for each node (min, max).

• i.e. for 4 nodes:• (1, 250), (251, 500), (501, 750), (751, 1000)

RecordReader is then responsible for generating (key, value) pairs from an InputSplit.

Our RecordReader will iterate from min to max on each node.

One RecordReader is used per Mapper.

23

RecordReader Interface

// the record reader is responsible for iterating over (key, value) pairs in

// an input split

public interface RecordReader<K, V> {

// creates a key/value for the record reader (for generics)

public K createKey();

public V createValue();

// get the position of the iterator

public int getPos();

// get the progress of the iterator (0.0 to 1.0)

public float getProgress();

// populate key and value with the next tuple from the InputSplit

public void next( K key, V value );

// marks the end of use of the RecordReader

public void close();

} // RecordReader24

PrimeInputFormat.getRecordReader();

// the record reader is responsible for iterating over (key, value) pairs in

// an input split

public RecordReader<Text, IntWritable> getRecordReader( InputSplit split,

JobConf conf, Reporter reporter ) throws IOException {

final RangeSplit range = ( RangeSplit )split;

// return a new anonymous inner class

return new RecordReader<Text, IntWritable>() {

int pos = 0;

public Text createKey() { return new Text(); }

public IntWritable createValue() { return new IntWritable(); }

public int getPos() { return pos; }

public float getProgress() {

return ( float )pos / ( float )( range.max – range.min );

}

// continued ...25

PrimeInputFormat.getRecordReader();

// return the next key, value pair and increment the position

public void next( Text key, IntWritable value ) throws IOException {

// get the number at this position

int val = range.min + pos;

// dummy key value

key.setText( “key” );

// set the number for the value

value.set( val );

// increment the position

pos++;

}

// close the RecordReader

public void close() { };

};

}

26

Prime Number Mapper

Our program Mapper now reads in a dummy key and a number. What should our new Map data types be?

• BooleanWritable = prime/not prime

• IntWritable = the number

Reducer can then add together all values with a “true” boolean key, and ignore all “false” values.

public static class PrimeMapper extends MapReduceBaseimplements Mapper<Text, IntWritable, BooleanWritable, IntWritable>

// ^mapper input ^ ^ mapper output ^

27

Prime Sum Mapper

public static class PrimeMapper extends MapReduceBaseimplements Mapper<Text, IntWritable, BooleanWritable, IntWritable> {

public void map( Text key, IntWritable value,

OutputCollector<BooleanWritable, IntWritable> output,

Reporter reporter ) throws IOException {

// check if the number is prime – choose your favorite prime number

// testing algorithm

if ( isPrime( value.get() ) ) {

output.collect( new BooleanWritable( true ), value );

} else {

output.collect( new BooleanWritable( false ), value );

}

}

} // PrimeMapper28

Prime Number Reducer

Reducer will take multiple (boolean, int) pairs, and reduce to a single (text, int) pair:

• “Sum”, int

public static class PrimeReducer extends MapReduceBaseimplements Reducer<BooleanWritable, IntWritable, Text, IntWritable>

// ^reducer input^ ^reducer output^

29

Prime Sum Reducer

public static class PrimeReducer extends MapReduceBaseimplements Reducer<BooleanWritable, IntWritable, Text, IntWritable> {

public void reduce( BooleanWritable key, Iterator<IntWritable> values,

OutputCollector<BooleanWritable, IntWritable> output,

Reporter reporter ) throws IOException {

// ignore the “false” values

if ( key.get() ) {

// sum the values and write to the output collector

int sum = 0;

for ( IntWritable val : values ) {

sum += val.get();

}

output.collector( key, new IntWritable( sum ) );

}

}

} // PrimeReducer30

Output Formats

Hadoop uses an OutputFormat just like an InputFormat.

Easiest to use:

• TextOutputFormat

• Prints “key: value”, one key per line to an output file

• File is written to the distributed Google File System.

31

Program Layout

Our example Prime Sum program:

public static void main( String[] args ) {

JobConf job = new JobConf( PrimeSum.class );

job.setJobName( "primesum" );

job.setOutputPath( “primesum-output” );

job.setOutputKeyClass( Text.class );

job.setOutputValueClass( IntWritable.class );

job.setMapperClass( PrimeMapper.class );

job.setReducerClass( PrimeReducer.class );

job.setInputFormat( PrimeInputFormat.class );

job.setOutputFormat( TextOutputFormat.class );

32

Program Layout

Last but not least:

// run the job

JobClient.runJob( job );

} // main

33

google mapreduce framework a summary of: mapreduce & hadoop api slides prepared by peter...

Documents