understanding hadoop through examples

23
Understanding Hadoop through examples Yoshitomo Matsubara University of California, Irvine https:// github.com/yoshitomo-matsubara/hadoop-example

Upload: yoshitomo-matsubara

Post on 14-Apr-2017

104 views

Category:

Technology


2 download

TRANSCRIPT

Page 1: Understanding Hadoop through examples

Understanding Hadoop

through examples

Yoshitomo Matsubara

University of California, Irvine

https://github.com/yoshitomo-matsubara/hadoop-example

Page 2: Understanding Hadoop through examples

Objective

To help you intuitively understand

how Hadoop works

• long descriptions

• Hadoop clusters

without

• simple examples

• small codes

with

Page 3: Understanding Hadoop through examples

What is Hadoop?

Apache’s Open-Source Software Framework

• Widely used for big data analytics(e.g., advertising and recommendation systems)

• MapReduce framework

Hope this helps you understand!

Page 4: Understanding Hadoop through examples

Hadoop (MapReduce)

Partitioner

Input Files

(Big Data)

Maps

Reduces

Output Files

(Big Data)

Map 1 Map 2 Map 3 Map N

Reduce 1 Reduce 2 Reduce 3 Reduce M

1 2 3 M

Input: each line

Output: key, value

Input: key, value list

Output: key, value

<- Not important for introduction

Page 5: Understanding Hadoop through examples

Hadoop (MapReduce)

Partitioner

Input Files

(Big Data)

Maps

Reduces

Output Files

(Big Data)

Map 1 Map 2 Map 3 Map N

Reduce 1 Reduce 2 Reduce 3 Reduce M

1 2 3 M

Input: each line

Output: key, value

Input: key, value list

Output: key, value

<- Not important for introduction

Page 6: Understanding Hadoop through examples

Two examples

1. Word count

• Famous example

• You can skip this if you know

2. Quiz grading• Original example

• Grade 5 quizzes for 500 students

Page 7: Understanding Hadoop through examples

Preparation

1. Fork, clone or download the following Maven project

2. Import the project to your IntelliJ IDEA or Eclipse(Checked with IntelliJ IDEA and Java 1.8)

3. Make sure that it has no build error

4. Go to the next slide

https://github.com/yoshitomo-matsubara/hadoop-example

Page 8: Understanding Hadoop through examples

Word count: Problem setting

• Input files: text files

• Maps: extract words from each input line

• Reduces: calculate total count for each word

• Output files: text files

Page 9: Understanding Hadoop through examples

Word count: Code in public

ymatsubara.hadoop.example.wordcount

• Input files: src/main/resource/input/wordcount/* Prepare English articles

Copy and paste their texts into input files

• Maps: WordCountMapper (Each Map = Each instance)

• Reduces: WordCountReducer (Each Reduce = Each instance)

• Output files: src/main/resource/output/wordcount/*

Page 10: Understanding Hadoop through examples

Word count: Input files

1. Prepare English articles

2. Copy and paste their texts into input files

Input format: e.g.,keyword1 keyword2 keyword3 keyword2 keyword3

keyword2 keyword4 keyword3. keyword5, keyword1

.

.

.

Page 11: Understanding Hadoop through examples

Word count: Map programming

Methods

• setupcalled once for each Map before the 1st map call, not important here

• mapKey: LongWritable longWritable (not important here)

Value: Text text (text = one line from input files)

-> called L times in total with input files (L lines in total)

• cleanupcalled once for each Map after all map calls, not important here

https://git.io/vXiI5

Page 12: Understanding Hadoop through examples

Word count: Reduce programming

Methods

• setupsimilar to Map’s setup, not important here

• reduceKey: Text key (key = a unique word)

Value: Iterable<IntWritable> values (value = 1)

-> called K times in total (K: # of unique words in input files)

• cleanupsimilar to Map’s cleanup, not important here

https://git.io/vXiIN

Page 13: Understanding Hadoop through examples

Word count: Output files

Number of output files

= Number of Reduces (M)

Output format:keyword1<TAB>count1

keyword2<TAB>count2

.

.

.

Page 14: Understanding Hadoop through examples

Word count: Run the code

Run WordCountDriver on your IDE!

&

Check the output files.(src/main/resource/output/wordcount/part-*)

https://git.io/vXiLW

Page 15: Understanding Hadoop through examples

Quiz grading: Problem setting

• Input files: text files

• Maps: link each quiz score with student ID

• Reduces: calculate total score for each student ID

• Output files: text files

Page 16: Understanding Hadoop through examples

Quiz grading: Code in public

ymatsubara.hadoop.example.grading

• Input files: src/main/resource/input/grading/*

• Maps: GradingMapper (Each Map = Each instance)

• Reduces: GradingReducer (Each Reduce = Each instance)

• Output files: src/main/resource/output/grading/*

Page 17: Understanding Hadoop through examples

Quiz grading: Input files

• Each file consists of 500 students’ IDs and scores for the quiz

• 5 quizzes, each quiz score is out of 10 (total max: 50pt)

Input format:ID1<TAB>score1

ID2<TAB>score2

.

.

.

ID500<TAB>score500

https://git.io/vXiIj

Page 18: Understanding Hadoop through examples

Quiz grading: Map programming

Methods

• setupcalled once for each Map before the 1st map call, not important here

• mapKey: LongWritable longWritable (not important here)

Value: Text text (text = one line from input files)

-> called 500 * 5 times in total with 5 input files (2,500 lines)

• cleanupcalled once for each Map after all map calls, not important here

https://git.io/vXiLU

Page 19: Understanding Hadoop through examples

Quiz grading: Reduce programming

Methods

• setupsimilar to Map’s setup, not important here

• reduceKey: Text key (key = a unique student ID)

Value: Iterable<IntWritable> values (value = each quiz score)

-> called 500 times in total (We have 500 unique student IDs.)

• cleanupsimilar to Map’s cleanup, not important here

https://git.io/vXiLt

Page 20: Understanding Hadoop through examples

Quiz grading: Output files

Output format:ID1<TAB>total score1

ID2<TAB>total score2

.

.

.

ID500<TAB>total score500

https://git.io/vXiLe

Page 21: Understanding Hadoop through examples

Quiz grading: Run the code

Run GradingDriver on your IDE!

&

Check the output files.(src/main/resource/output/grading/part-*)

https://git.io/vXiLc

Page 22: Understanding Hadoop through examples

Summary

• Introduced the two examples with the codes for Hadoop

• You ran the codes on your local computer and (hopefully) understood how Hadoop (MapReduce) works

• Now you can set up a similar problem and write codes to solve it, based on the above codes

Page 23: Understanding Hadoop through examples

Thank you for viewing!

If you have any questions or suggestions,

please leave a comment. :)

Yoshitomo Matsubara