introducing mapreduce programming framework
TRANSCRIPT
Introducing MapReduce
Programming ModelSamuel Yee
*Multi-threaded Programming
MapReduce Programming Model
For parallelization & distributed computing, programmers don’t have to worry about multi-threading, system failure, file I/O, networking, data loss etc. All these complex low-level activities are taken care of by Hadoop.
Focus on 2 key functions instead: Mapper and Reducer Mapper function
Ingest from large input files Split up into many smaller blocks (default 64MB per block size) Transform inputs into key-value pairs, shuffle and map them to Reduce function
Reducer function Reduce outputs by aggregating, summing, eliminating etc. Write to output files
Key-Value pairs must match between Mapper and Reducer functions
Data Processing (MapReduce)
Input Data
Map()
Map()
Map()
Reduce()
Reduce()
Output Data
Split[k1, v1]
Sort byk1
Merge[k1, [v1, v2, v3…]]
Hadoop’s Approach
Big Data
Block
Block
Block
Block
Block
Block
Split into smaller data blocks
Hadoop’s Approach
Block
Block
Block
Block
Block
Block
Computing
Computing
Computing
Computing
Computing
Computing
Map Computing Process to Data Blocks
Reduce outputs by aggregating into a result
Output
Output
Output
Output
Output
Output
Consider Two Input Files
File01.txt: Hello World Bye World File02.txt: Hello Hadoop Goodbye Hadoop
Outputs of Mappers
Process 1 [Hello, 1] [Hadoop, 1] [Goodbye, 1] [Hadoop, 1]
Process 2 [Hello, 1] [World, 1] [Bye, 1] [World, 1]
Consolidated Result of Reducers
[Bye, 1] [Goodbye, 1] [Hadoop, 2] [Hello, 2] [World, 2]
MapReduce Template in Java
Demo
MapReduce programming using IntelliJ IDEA and Java Read my LinkedIn articles on how to setup development environment
for MapReduce and Spark on Windows http://tinyurl.com/px9rwwk