google mapreduce framework a summary of: mapreduce & hadoop api slides prepared by peter...
TRANSCRIPT
Google MapReduce Framework
A Summary of:
MapReduce
&
Hadoop API
Slides prepared by Peter Erickson ([email protected])
What is MapReduce?
Massively parallel processing of very large data sets (larger than 1TB).
Parallelize the computations over hundreds or thousands of CPUs.
Ensure fault tolerance.
Do this all through an easy-to-use, abstract and reusable framework.
2
What is MapReduce?
Simple algorithm that can be easily parallelized, and efficiently handles large sets of data:
MAP – Sort data in key, value pairs
then
REDUCE – Perform a reduction across n maps (nodes)
- Simple, clean abstraction for programmers.
3
Implementation
Programmers need only to implement two functions:
map ( Object key, Object value )
-> Map<Object, Object>
reduce ( Object key, List<Object> values )
-> Map<Object, Object>
Let's look at an example to better understand these functions.
4
Word Count Example Program
Given a document of words, sum the usage of all words, i.e. (the:42), (is:10), (a:23), (computer:2), etc
How would you perform this task sequentially?
• Map<Word, Integer> maps words to the number of occurrences in the document.
How would you perform this task in parallel?
• Split the document amongst n processors and map words in same way, then reduce all nodes.
5
Word Count: Map function
Map is performed on each node, using a subset of the overall data.
// input_key: document name
// input_value: document contents
Map<String, Integer> map( String input_key, String input_value ) {
// iterate over all words (.split( “\\s” separates a string between spaces)
for ( String word : input_value.split( “\\s” ) {
// insert “1” into the Map for this word (there will be collisions)
output.put( word, "1" );
}
}
6
Word Count: Map function
// input_key: document name
// input_value: document contents
map( String input_key, String input_value ) {
// iterate over all words (.split( “\\s” separates a string between spaces)
for ( String word : input_value.split( “\\s” ) {
// insert “1” into the Map for this word (there will be collisions)
output.put( word, "1" );
}
} - Collisions in the map are IMPORTANT.
- Values with the same key are passed to the Reduce Function.
- The Reduce Function makes the decision of how to merge these
collisions.
7
Reduction Phase
Values in the map are merged in some way using the Reduce Function.
The Reduce Function can add, subtract, multiply, divide, take the average, ignore all data, or anything the programmer chooses.
reduce() is passed a key and a list of values, all that share the same key.
reduce() is expected to return a new Map with reduced values.
8
Reduction Phase
For example, if reduce() is passed:
Key (word): “the”
Values: { 1, 1, 1, 1, 1 }
it would probably form a new summed (key, value) pair to the output map:
Key (word): “the”
Value: 5
9
Reduction Phase
Reduction can happen on any number of nodes before forming a final map:
Node 1 Node 2 Node 3
[“the”, {1, 1, 1}] [“the”, {1, 1, 1, 1}] [“the”, {1}]
------------------REDUCE-----------------
[“the”, {7}] [“the”, {1}]
---------------------REDUCE------------------
Final Result: [“the”, {8}]
10
Word Count: Reduce Function
// key: a word
// values: a list of collided values for this word
Map<String, Integer> reduce( String key, List<Integer> values ) {
// sum the counts for this word
int total = 0;
// iterate over all integers in the collided value list
for ( int count : values ) {
total += count;
}
output.put( key, total );
}11
Hadoop: Java MapReduce Implementation
Hadoop is an open source project run by the Apache Foundation
Provides an API to write MapReduce programs easily and efficiently in Java
Installed on the Thug (Paranoia) cluster at RIT
Used in associated with the Google Filesystem
Hadoop has several frameworks for different parts of the MapReduce paradigm.
12
Hadoop API
A Summary of:
• Input Formats
• Record Readers
• Mapping Function
• Reducing Function
• Output Formats
• General Program Layout
Hadoop Input Formats
Most Hadoop programs read their input from a file.
The data from a Hadoop input file must be parsed as map (key, value) pairs, even before the map() function.
The key and value types from the input file are separate from the key and value types using for map and reduce.
A very simple input format: TextInputFormat
• Key: Line number
• Value:A line of text from the input file
14
Hadoop Data Types
Special data types are used in Hadoop, for the purpose of serialization between nodes, and data comparison:
• IntWritable – integer type
• LongWritable – long integer type
• Text – string type• many, many more
The WordCount program used the TextInputFormat, as we saw in the map() function:
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter ) throws IOException {
15
Word Count Example Program
public void map(LongWritable key, Text value,OutputCollector<Text, IntWritable> output,Reporter reporter ) throws IOException {
key = line number
value = text from document at line number
output = OutputCollector, or special Map that allows collisions• output is written to the output collector
reporter = special Hadoop object for reporting progress• doesn't need to be used – for more advanced programs
16
Hadoop Input Format
The InputFormat object of the program reads data on a single node from a file, or other source.
The InputFormat is then asked to partition the data into smaller sets of data for each node to process.
Let's look at a new sample program, that does not read any input from a file:
• Sum all prime numbers between 1 and 1000.
17
Prime Number Example
We could make a file that lists all numbers from 1 to 1000, but this is unnecessary: we can write our own InputFormat class to generate these numbers and divide them amongst the nodes: PrimeInputFormat
The InputFormat should generate numbers 1 to 1000, then split the numbers into n groups.
InputFormats generate or read (key, value) pairs, so what should we use?
• No need to use a key, we only have values (numbers), so we use a dummy key.
18
InputFormat Interface
// interface to be used with objects of type K and V
public interface InputFormat<K, V> {
// validates the input for the specified job (can be ignored)public void validateInput( JobConf job ) throws IOException;
// returns an array of “InputSplit” objects that are sent to each node;// the number of splits to be made is represented by numSplitspublic InputSplit[] getSplits( JobConf job, int numSplits )
throws IOException;
// returns a “iterator” of sorts for a node to extract (key, value) pairs// from an InputSplitpublic RecordReader<K, V> getRecordReader( InputSplit split,
JobConf job, Reporter reporter ) throws IOException;
}
19
InputSplit Interface
public interface InputSplit extends Writable {
// get total bytes of data in this input splitpublic long getLength();
// get the hostnames of where the splits are (can be ignored)public String[] getLocations();
}
// represents an object that can be serialized/deserialized
public interface Writable {
// read in the fields of this object from the DataInput objectpublic void readFields( DataInput in );
// write the fields of this object to the DataOutput objectpublic void write( DataOutput out );
}
20
Prime Number InputSplit: RangeSplit
An InputSplit for our program need only hold the range of numbers for each node (min, max).
public class RangeSplit implements InputSplit {
int min, max;
public RangeSplit() { super(); }
public long getLength() { return ( long )( max – min ); }
public String[] getLocations() { return new String[]{}; }
public void write( DataOutput out ) {
out.writeInt( min ); out.writeInt( max );
}
public void readFields( DataInput in ) {
min = in.readInt(); max = in.readInt();
}
}
21
PrimeInputFormat.getSplits();
Create RangeSplit objects for our program:
public InputSplit[] getSplits( JobConf job, int numSplits ) throws IOException {
RangeSplit[] splits = new RangeSplit[ numSplits ];
// for simplicity sake, we're going to assume 1000 is evenly divisible
// by numSplits, but this may not always be the case
int len = 1000 / numSplits;
for( int i = 0; i < n; i++ ) {
splits[ i ].min = ( i * len ) + 1;
splits[ i ].max = ( i + 1 ) * len;
} // for
return splits;
} // getSplits
22
Record Reader
An InputSplit for our program generates range of numbers for each node (min, max).
• i.e. for 4 nodes:• (1, 250), (251, 500), (501, 750), (751, 1000)
RecordReader is then responsible for generating (key, value) pairs from an InputSplit.
Our RecordReader will iterate from min to max on each node.
One RecordReader is used per Mapper.
23
RecordReader Interface
// the record reader is responsible for iterating over (key, value) pairs in
// an input split
public interface RecordReader<K, V> {
// creates a key/value for the record reader (for generics)
public K createKey();
public V createValue();
// get the position of the iterator
public int getPos();
// get the progress of the iterator (0.0 to 1.0)
public float getProgress();
// populate key and value with the next tuple from the InputSplit
public void next( K key, V value );
// marks the end of use of the RecordReader
public void close();
} // RecordReader24
PrimeInputFormat.getRecordReader();
// the record reader is responsible for iterating over (key, value) pairs in
// an input split
public RecordReader<Text, IntWritable> getRecordReader( InputSplit split,
JobConf conf, Reporter reporter ) throws IOException {
final RangeSplit range = ( RangeSplit )split;
// return a new anonymous inner class
return new RecordReader<Text, IntWritable>() {
int pos = 0;
public Text createKey() { return new Text(); }
public IntWritable createValue() { return new IntWritable(); }
public int getPos() { return pos; }
public float getProgress() {
return ( float )pos / ( float )( range.max – range.min );
}
// continued ...25
PrimeInputFormat.getRecordReader();
// return the next key, value pair and increment the position
public void next( Text key, IntWritable value ) throws IOException {
// get the number at this position
int val = range.min + pos;
// dummy key value
key.setText( “key” );
// set the number for the value
value.set( val );
// increment the position
pos++;
}
// close the RecordReader
public void close() { };
};
}
26
Prime Number Mapper
Our program Mapper now reads in a dummy key and a number. What should our new Map data types be?
• BooleanWritable = prime/not prime
• IntWritable = the number
Reducer can then add together all values with a “true” boolean key, and ignore all “false” values.
public static class PrimeMapper extends MapReduceBaseimplements Mapper<Text, IntWritable, BooleanWritable, IntWritable>
// ^mapper input ^ ^ mapper output ^
27
Prime Sum Mapper
public static class PrimeMapper extends MapReduceBaseimplements Mapper<Text, IntWritable, BooleanWritable, IntWritable> {
public void map( Text key, IntWritable value,
OutputCollector<BooleanWritable, IntWritable> output,
Reporter reporter ) throws IOException {
// check if the number is prime – choose your favorite prime number
// testing algorithm
if ( isPrime( value.get() ) ) {
output.collect( new BooleanWritable( true ), value );
} else {
output.collect( new BooleanWritable( false ), value );
}
}
} // PrimeMapper28
Prime Number Reducer
Reducer will take multiple (boolean, int) pairs, and reduce to a single (text, int) pair:
• “Sum”, int
public static class PrimeReducer extends MapReduceBaseimplements Reducer<BooleanWritable, IntWritable, Text, IntWritable>
// ^reducer input^ ^reducer output^
29
Prime Sum Reducer
public static class PrimeReducer extends MapReduceBaseimplements Reducer<BooleanWritable, IntWritable, Text, IntWritable> {
public void reduce( BooleanWritable key, Iterator<IntWritable> values,
OutputCollector<BooleanWritable, IntWritable> output,
Reporter reporter ) throws IOException {
// ignore the “false” values
if ( key.get() ) {
// sum the values and write to the output collector
int sum = 0;
for ( IntWritable val : values ) {
sum += val.get();
}
output.collector( key, new IntWritable( sum ) );
}
}
} // PrimeReducer30
Output Formats
Hadoop uses an OutputFormat just like an InputFormat.
Easiest to use:
• TextOutputFormat
• Prints “key: value”, one key per line to an output file
• File is written to the distributed Google File System.
31
Program Layout
Our example Prime Sum program:
public static void main( String[] args ) {
JobConf job = new JobConf( PrimeSum.class );
job.setJobName( "primesum" );
job.setOutputPath( “primesum-output” );
job.setOutputKeyClass( Text.class );
job.setOutputValueClass( IntWritable.class );
job.setMapperClass( PrimeMapper.class );
job.setReducerClass( PrimeReducer.class );
job.setInputFormat( PrimeInputFormat.class );
job.setOutputFormat( TextOutputFormat.class );
32
Program Layout
Last but not least:
// run the job
JobClient.runJob( job );
} // main
33