advanced topics on mapreduce with hadoop jiaheng lu department of computer science renmin university...
TRANSCRIPT
Advanced topics on Mapreduce with Hadoop
Jiaheng Lu
Department of Computer Science
Renmin University of Chinawww.jiahenglu.net
Outline
Brief Review Chaining MapReduce Jobs Join in MapReduce Bloom Filter
Brief Review
A parallel programming framework Divide and merge
split0
split1
split2
Input data
Map task
Mappers
Map task
Map task
Shuffle
Reduce task
Reducers
Reduce task
Output data
output0
output1
Chaining MapReduce jobs
Chaining in a sequence Chaining with complex dependency Chaining preprocessing and postprocessing
steps
Chaining in a sequence
Simple and straightforward [MAP | REDUCE]+; MAP+ | REDUCE | MAP* Output of last is the input to the next Similar to pipes
Configuration conf = getConf();
JobConf job = new JobConf(conf);
job.setJobName("ChainJob");
job.setInputFormat(TextInputFormat.class);
job.setOutputFormat(TextOutputFormat.class);
FileInputFormat.setInputPaths(job, in);
FileOutputFormat.setOutputPath(job, out);
JobConf map1Conf = new JobConf(false);
ChainMapper.addMapper(job, Map1.class, LongWritable.class, Text.class, Text.class, Text.class, true, map1Conf);
Chaining with complex dependency
Jobs are not chained in a linear fashion
Use addDependingJob() method to add dependency information:
x.addDependingJob(y)
Chaining preprocessing and postprocessing steps
Example: remove stop word in IR Approaches:
Separate: inefficient Chaining those steps into a single job
Use ChainMapper.addMapper() and ChainReducer.setReducer
Map+ | Reduce | Map*
Join in MapReduce
Reduce-side join Broadcast join Map-side filtering and Reduce-side join
A given key A range from dataset(broadcast) a Bloom filter
Reduce-side join
Map output <key, value> key>>join key, value>>tagged with data source
Reduce do a full cross-product of values output the combination results
Example
a b
1 ab
1 cd
4 ef
a c
1 b
2 d
4 c
table x
table y
map()
map()
1
4
key
x ab
x cd
x ef
value
1
2
4
key
y b
y d
y c
valuetag
join key
shuffle()
1
key
x ab
x cd
y b
valuelist
2 y d
4x ef
y c
reduce()
a b c
1 ab b
1 cd b
4 ef c
output
1
Broadcast join (replicated join)
Broadcast the smaller table Do join in Map()
Using distributed cache
DistributedCache.addCacheFile()
Map-side filtering and Reduce-side join
Join key: student IDs from info generate IDs file from info broadcast join
What if the IDs file can’t be stored in memory? a Bloom Filter
A Bloom Filter
Introduction Implementation of bloom filter Use in MapReduce join
Introduction to Bloom Filter
space-efficient data structure, constant size, test elements, add(), contains()
no false negatives and a small probability of false positives
Implementation of bloom filter
Apply a bit array Add elements
generate k indexes set the k bits to 1
Test elements generate k indexes all k bits are 1 >> true, not all are 1 >> false
Example
0
0
0
0
0
0
0
0
0
0
0
1
2
3
4
5
6
7
8
9
1
0
1
0
0
0
1
0
0
0
0
1
2
3
4
5
6
7
8
9
add x(0,2,6)
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
add y(0,3,9)
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
contain m(1,3,9)
1
0
1
1
0
0
1
0
0
1
0
1
2
3
4
5
6
7
8
9
contain n(0,2,9)initial state
① ② ③ ④ ⑤
× √false positives
Use in MapReduce join
A separate subjob to create a Bloom Filter
Broadcast the Bloom Filter and use in Map() of join job
drop the useless record, and do join in reduce
References
Chunk Lam, “Hadoop in action” Jairam Chandar, “Join Algorithms using
Map/Reduce”
THANK YOU
Hadoop