understanding hadoop through examples

Understanding Hadoop

through examples

Yoshitomo Matsubara

University of California, Irvine

https://github.com/yoshitomo-matsubara/hadoop-example


Objective

To help you intuitively understand

how Hadoop works

• long descriptions

• Hadoop clusters

without

• simple examples

• small codes

with

What is Hadoop?

Apache’s Open-Source Software Framework

• Widely used for big data analytics(e.g., advertising and recommendation systems)

• MapReduce framework

Hope this helps you understand!

Hadoop (MapReduce)

Partitioner

Input Files

(Big Data)

Maps

Reduces

Output Files

(Big Data)

Map 1 Map 2 Map 3 Map N

Reduce 1 Reduce 2 Reduce 3 Reduce M

1 2 3 M

Input: each line

Output: key, value

Input: key, value list

Output: key, value

<- Not important for introduction

Two examples

1. Word count

• Famous example

• You can skip this if you know

2. Quiz grading• Original example

• Grade 5 quizzes for 500 students

Preparation

1. Fork, clone or download the following Maven project

2. Import the project to your IntelliJ IDEA or Eclipse(Checked with IntelliJ IDEA and Java 1.8)

3. Make sure that it has no build error

4. Go to the next slide



Word count: Problem setting

• Input files: text files

• Maps: extract words from each input line

• Reduces: calculate total count for each word

• Output files: text files

Word count: Code in public

ymatsubara.hadoop.example.wordcount

• Input files: src/main/resource/input/wordcount/* Prepare English articles

Copy and paste their texts into input files

• Maps: WordCountMapper (Each Map = Each instance)

• Reduces: WordCountReducer (Each Reduce = Each instance)

• Output files: src/main/resource/output/wordcount/*

Word count: Input files

1. Prepare English articles

2. Copy and paste their texts into input files

Input format: e.g.,keyword1 keyword2 keyword3 keyword2 keyword3

keyword2 keyword4 keyword3. keyword5, keyword1

.

.

.

Word count: Map programming

Methods

• setupcalled once for each Map before the 1st map call, not important here

• mapKey: LongWritable longWritable (not important here)

Value: Text text (text = one line from input files)

-> called L times in total with input files (L lines in total)

• cleanupcalled once for each Map after all map calls, not important here

https://git.io/vXiI5

https://git.io/vXiI5

Word count: Reduce programming

Methods

• setupsimilar to Map’s setup, not important here

• reduceKey: Text key (key = a unique word)

Value: Iterable<IntWritable> values (value = 1)

-> called K times in total (K: # of unique words in input files)

• cleanupsimilar to Map’s cleanup, not important here

https://git.io/vXiIN

https://git.io/vXiIN

Word count: Output files

Number of output files

= Number of Reduces (M)

Output format:keyword1<TAB>count1

keyword2<TAB>count2

.

.

.

Word count: Run the code

Run WordCountDriver on your IDE!

&

Check the output files.(src/main/resource/output/wordcount/part-*)

https://git.io/vXiLW

https://git.io/vXiLW

Quiz grading: Problem setting

• Input files: text files

• Maps: link each quiz score with student ID

• Reduces: calculate total score for each student ID

• Output files: text files

Quiz grading: Code in public

ymatsubara.hadoop.example.grading

• Input files: src/main/resource/input/grading/*

• Maps: GradingMapper (Each Map = Each instance)

• Reduces: GradingReducer (Each Reduce = Each instance)

• Output files: src/main/resource/output/grading/*

Quiz grading: Input files

• Each file consists of 500 students’ IDs and scores for the quiz

• 5 quizzes, each quiz score is out of 10 (total max: 50pt)

Input format:ID1<TAB>score1

ID2<TAB>score2

.

.

.

ID500<TAB>score500

https://git.io/vXiIj

https://git.io/vXiIj

Quiz grading: Map programming

Methods

• setupcalled once for each Map before the 1st map call, not important here

• mapKey: LongWritable longWritable (not important here)

Value: Text text (text = one line from input files)

-> called 500 * 5 times in total with 5 input files (2,500 lines)

• cleanupcalled once for each Map after all map calls, not important here

https://git.io/vXiLU

https://git.io/vXiLU

Quiz grading: Reduce programming

Methods

• setupsimilar to Map’s setup, not important here

• reduceKey: Text key (key = a unique student ID)

Value: Iterable<IntWritable> values (value = each quiz score)

-> called 500 times in total (We have 500 unique student IDs.)

• cleanupsimilar to Map’s cleanup, not important here

https://git.io/vXiLt

https://git.io/vXiLt

Quiz grading: Output files

Output format:ID1<TAB>total score1

ID2<TAB>total score2

.

.

.

ID500<TAB>total score500

https://git.io/vXiLe

https://git.io/vXiLe

Quiz grading: Run the code

Run GradingDriver on your IDE!

&

Check the output files.(src/main/resource/output/grading/part-*)

https://git.io/vXiLc

https://git.io/vXiLc

Summary

• Introduced the two examples with the codes for Hadoop

• You ran the codes on your local computer and (hopefully) understood how Hadoop (MapReduce) works

• Now you can set up a similar problem and write codes to solve it, based on the above codes

Thank you for viewing!

If you have any questions or suggestions,

please leave a comment. :)

Yoshitomo Matsubara

understanding hadoop through examples

Technology