understanding hadoop through examples
TRANSCRIPT
Understanding Hadoop
through examples
Yoshitomo Matsubara
University of California, Irvine
https://github.com/yoshitomo-matsubara/hadoop-example
Objective
To help you intuitively understand
how Hadoop works
• long descriptions
• Hadoop clusters
without
• simple examples
• small codes
with
What is Hadoop?
Apache’s Open-Source Software Framework
• Widely used for big data analytics(e.g., advertising and recommendation systems)
• MapReduce framework
Hope this helps you understand!
Hadoop (MapReduce)
Partitioner
Input Files
(Big Data)
Maps
Reduces
Output Files
(Big Data)
Map 1 Map 2 Map 3 Map N
Reduce 1 Reduce 2 Reduce 3 Reduce M
1 2 3 M
Input: each line
Output: key, value
Input: key, value list
Output: key, value
<- Not important for introduction
Hadoop (MapReduce)
Partitioner
Input Files
(Big Data)
Maps
Reduces
Output Files
(Big Data)
Map 1 Map 2 Map 3 Map N
Reduce 1 Reduce 2 Reduce 3 Reduce M
1 2 3 M
Input: each line
Output: key, value
Input: key, value list
Output: key, value
<- Not important for introduction
Two examples
1. Word count
• Famous example
• You can skip this if you know
2. Quiz grading• Original example
• Grade 5 quizzes for 500 students
Preparation
1. Fork, clone or download the following Maven project
2. Import the project to your IntelliJ IDEA or Eclipse(Checked with IntelliJ IDEA and Java 1.8)
3. Make sure that it has no build error
4. Go to the next slide
https://github.com/yoshitomo-matsubara/hadoop-example
Word count: Problem setting
• Input files: text files
• Maps: extract words from each input line
• Reduces: calculate total count for each word
• Output files: text files
Word count: Code in public
ymatsubara.hadoop.example.wordcount
• Input files: src/main/resource/input/wordcount/* Prepare English articles
Copy and paste their texts into input files
• Maps: WordCountMapper (Each Map = Each instance)
• Reduces: WordCountReducer (Each Reduce = Each instance)
• Output files: src/main/resource/output/wordcount/*
Word count: Input files
1. Prepare English articles
2. Copy and paste their texts into input files
Input format: e.g.,keyword1 keyword2 keyword3 keyword2 keyword3
keyword2 keyword4 keyword3. keyword5, keyword1
.
.
.
Word count: Map programming
Methods
• setupcalled once for each Map before the 1st map call, not important here
• mapKey: LongWritable longWritable (not important here)
Value: Text text (text = one line from input files)
-> called L times in total with input files (L lines in total)
• cleanupcalled once for each Map after all map calls, not important here
https://git.io/vXiI5
Word count: Reduce programming
Methods
• setupsimilar to Map’s setup, not important here
• reduceKey: Text key (key = a unique word)
Value: Iterable<IntWritable> values (value = 1)
-> called K times in total (K: # of unique words in input files)
• cleanupsimilar to Map’s cleanup, not important here
https://git.io/vXiIN
Word count: Output files
Number of output files
= Number of Reduces (M)
Output format:keyword1<TAB>count1
keyword2<TAB>count2
.
.
.
Word count: Run the code
Run WordCountDriver on your IDE!
&
Check the output files.(src/main/resource/output/wordcount/part-*)
https://git.io/vXiLW
Quiz grading: Problem setting
• Input files: text files
• Maps: link each quiz score with student ID
• Reduces: calculate total score for each student ID
• Output files: text files
Quiz grading: Code in public
ymatsubara.hadoop.example.grading
• Input files: src/main/resource/input/grading/*
• Maps: GradingMapper (Each Map = Each instance)
• Reduces: GradingReducer (Each Reduce = Each instance)
• Output files: src/main/resource/output/grading/*
Quiz grading: Input files
• Each file consists of 500 students’ IDs and scores for the quiz
• 5 quizzes, each quiz score is out of 10 (total max: 50pt)
Input format:ID1<TAB>score1
ID2<TAB>score2
.
.
.
ID500<TAB>score500
https://git.io/vXiIj
Quiz grading: Map programming
Methods
• setupcalled once for each Map before the 1st map call, not important here
• mapKey: LongWritable longWritable (not important here)
Value: Text text (text = one line from input files)
-> called 500 * 5 times in total with 5 input files (2,500 lines)
• cleanupcalled once for each Map after all map calls, not important here
https://git.io/vXiLU
Quiz grading: Reduce programming
Methods
• setupsimilar to Map’s setup, not important here
• reduceKey: Text key (key = a unique student ID)
Value: Iterable<IntWritable> values (value = each quiz score)
-> called 500 times in total (We have 500 unique student IDs.)
• cleanupsimilar to Map’s cleanup, not important here
https://git.io/vXiLt
Quiz grading: Output files
Output format:ID1<TAB>total score1
ID2<TAB>total score2
.
.
.
ID500<TAB>total score500
https://git.io/vXiLe
Quiz grading: Run the code
Run GradingDriver on your IDE!
&
Check the output files.(src/main/resource/output/grading/part-*)
https://git.io/vXiLc
Summary
• Introduced the two examples with the codes for Hadoop
• You ran the codes on your local computer and (hopefully) understood how Hadoop (MapReduce) works
• Now you can set up a similar problem and write codes to solve it, based on the above codes
Thank you for viewing!
If you have any questions or suggestions,
please leave a comment. :)
Yoshitomo Matsubara