cs535 big data 1/30/2019 week 2-b sangmi lee pallickaracs535/slides/week2-b.pdf1/30/2019 colorado...

12
CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 1 Week 2-A-0 CS535 BIG DATA PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 FAQs Term project deliverable 0 Item 1: Your team members Item 2: Tentative project titles (up to 3) Submission deadline: Feb. 1 Via email or canvas PA1 Hadoop and Spark installation guides are posted If you would like to start your homework, please send me an email with your team information. I will assign the port range for your team. Quiz 1: February 4. 2019 in class 1/30/2019 Colorado State University, Spring 2019 Week 2-A-1 Topics of Todays Class Overview of the Programing Assignment 1 3. Distributed Computing Models for Scalable Batch Computing MapReduce 1/30/2019 Colorado State University, Spring 2019 Week 2-A-2 Programming Assignment 1 Hyperlink-Induced Topic Search (HITS) 1/30/2019 Colorado State University, Spring 2019 Week 2-A-3 This material is built based on Kleinberg, Jon. "Authoritative sources in a hyperlinked environment". Journal of the ACM. 46 (5): 604–632 1/30/2019 Colorado State University, Spring 2019 Week 2-A-4 Types of Web queries Yes/No queries Does Chrome support .ogv video format? Broad topic queries Find information about “polar vortex” Similar-page query Find pages similar to ‘https://stackoverflow.com’ Image credit: https://www.cnn.com/2019/01/30/weather/winter-weather-wednesday-wxc/index.html 1/30/2019 Colorado State University, Spring 2019 Week 2-A-5

Upload: others

Post on 15-Mar-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 1

Week 2-A-0

CS535 BIG DATA

PART A. BIG DATA TECHNOLOGY3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING

Sangmi Lee PallickaraComputer Science, Colorado State Universityhttp://www.cs.colostate.edu/~cs535

FAQs• Term project deliverable 0

• Item 1: Your team members• Item 2: Tentative project titles (up to 3)• Submission deadline: Feb. 1 • Via email or canvas

• PA1• Hadoop and Spark installation guides are posted• If you would like to start your homework, please send me an email with your team information. I will

assign the port range for your team.

• Quiz 1: February 4. 2019 in class

1/30/2019 Colorado State University, Spring 2019 Week 2-A-1

Topics of Todays Class• Overview of the Programing Assignment 1• 3. Distributed Computing Models for Scalable Batch Computing

• MapReduce

1/30/2019 Colorado State University, Spring 2019 Week 2-A-2

Programming Assignment 1Hyperlink-Induced Topic Search (HITS)

1/30/2019 Colorado State University, Spring 2019 Week 2-A-3

This material is built based on• Kleinberg, Jon. "Authoritative sources in a hyperlinked environment". Journal of the

ACM. 46 (5): 604–632

1/30/2019 Colorado State University, Spring 2019 Week 2-A-4

Types of Web queries• Yes/No queries

• Does Chrome support .ogv video format?

• Broad topic queries• Find information about “polar vortex”

• Similar-page query• Find pages similar to ‘https://stackoverflow.com’

Image credit: https://www.cnn.com/2019/01/30/weather/winter-weather-wednesday-wxc/index.html

1/30/2019 Colorado State University, Spring 2019 Week 2-A-5

Page 2: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 2

Ranking algorithm to find the most “authoritative” pages• To find the small set of the most authoritative pages that are relevant to the query

• Examples of the authoritative pages• For the topic, “python”

• https://www.python.org/

• For the information about “Colorado State University”• https://www.colostate.edu/

• For the images about ”iPhone”• https://www.apple.com/iphone/

1/30/2019 Colorado State University, Spring 2019 Week 2-A-6

Challenge of content-based ranking• Most useful pages do not include the keyword (that the users are looking for)

• ”computer” in the APPLE page?

Captured Jan.30, 2019

1/30/2019 Colorado State University, Spring 2019 Week 2-A-7

Challenge of content-based ranking• How about IBM’s web page?

Captured Jan.30, 2019

1/30/2019 Colorado State University, Spring 2019 Week 2-A-8

Challenge of content-based ranking• Pages are not sufficiently descriptive

• “health care” in Poudre Valley Hospital?

Captured Jan.30, 2019

1/30/2019 Colorado State University, Spring 2019 Week 2-A-9

HITS (Hipertext-Induced Topic Search)• PageRank captures simplistic view of a network

• Authority• A Web page with good, authoritative content on a specific topic• A Web page that is linked by many hubs

• Hub• A Web page pointing to many authoritative Web pages

• e.g. portal pages (Yahoo)

1/30/2019 Colorado State University, Spring 2019 Week 2-A-10

HITS (Hypertext-Induced Topic Search)• A.K.A. Hubs and Authorities

• Jon Kleinberg 1997• Topic search• Automatically determine hubs/authorities

• In practice• Performed only on the result set (PageRank is applied on the complete set of documents)• Developed for the IBM Clever project• Used by Teoma (later Ask.com)

1/30/2019 Colorado State University, Spring 2019 Week 2-A-11

Page 3: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 3

Understanding Authorities and Hubs [1/2]• Intuitive Idea to find authoritative results using link analysis:

• Not all hyperlinks are related to the conferral of authority

• Patterns that authoritative pages have• Authoritative Pages share considerable overlap in the sets of pages that point to them.

AuthoritiesHubs

1/30/2019 Colorado State University, Spring 2019 Week 2-A-12

Understanding Authorities and Hubs [2/2]• A good hub page points to many good authoritative pages

• A good authoritative page is pointed to by many good hub pages

• Authorities and hubs have a mutual reinforcement relationship

1/30/2019 Colorado State University, Spring 2019 Week 2-A-13

Calculating Authority/Hub scores [1/3]

0 1 1 10 0 1 11 0 0 10 0 0 1

P1

P2

P3

P4P1

P2

P3

P4

Let there be n Web pagesDefine the n x n adjacency matrix A such that,Auv= 1 if there is a link from u to v.

Otherwise Auv= 0

Graph with pages

1/30/2019 Colorado State University, Spring 2019 Week 2-A-14

Calculating Authority/Hub scores [2/3]

0 1 1 10 0 1 11 0 0 10 0 0 1

P1

P2

P3

P4

Each Web page has an authority score ai and a hub score hi.We define the authority score by summing up the hub scores that point to it,

!" =$%&'

(

ℎ%*%"

j: row # in the matrixi: column # in the matrix

This can be written concisely as,

! = *+ℎ

Graph with pages

1/30/2019 Colorado State University, Spring 2019 Week 2-A-15

Calculating Authority/Hub scores [3/3]

0 1 1 10 0 1 11 0 0 10 0 0 1

P1

P2

P3

P4Similarly, we define the hub score by summing up the authority scores !" ,

ℎ$ =&"'(

)

!"*"$

j: row # in the matrixi: column # in the matrix

This can be written concisely as, ℎ = *!

Graph with pages

1/30/2019 Colorado State University, Spring 2019 Week 2-A-16

Hubs and Authorities

0 1 1 10 0 1 11 0 0 1

0 0 0 1

P1

P2

P3

P4

Let’s start arbitrarily from a0=1, h0=1, where 1 is the all-one vector. a0=(1,1,1,1)h0=(1,1,1,1)Repeating this, the sequences a0, a1, a2,… and h0, h1, h2,…converge (to limits x* and y*)a1=(((1x0)+(1x0)+(1x1)+(1x0)),

((1x1)+(1x0)+(1x0)+(1x0)),((1x1)+(1x1)+(1x0)+(1x0)),((1x1)+(1x1)+(1x1)+(1x1))) = (1,1,2,4)

Normalize it: (1/(1+1+2+4), 1/(1+1+2+4), 2/(1+1+2+4), 4/(1+1+2+4)) = (1/8, 1/8, ¼, ½)a1= (1/8, 1/8, ¼, ½) (ß authority values after the first iteration)

Graph with pages

1/30/2019 Colorado State University, Spring 2019 Week 2-A-17

Page 4: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 4

Hubs and Authorities

0 1 1 10 0 1 11 0 0 10 0 0 1

Let’s start arbitrarily from a0=1, h0=1, where 1 is the all-one vector. a0=(1,1,1,1)h0=(1,1,1,1)a1=(1/8, 1/8, ¼, ½)h1=(((1/8 x0)+(1/8x1)+(1/4x1)+(1/2x1)),

((1/8x0)+(1/8x0)+(1/4x1)+(1/2x1)),((1/8x1)+(1/8x0)+(1/4x0)+(1/2x1)),((1/8x0)+(1/8x0)+(1/4x0)+(1/2x1))) = (7/8,6/8,5/8, 4/8)

After the normalization: h1=(7/22,6/22,5/22, 4/22) (ß hub values after the first iteration)

Graph with pages

1/30/2019 Colorado State University, Spring 2019 Week 2-A-18

Implementing Topic Search using HITS• Step 1.

• Constructing a focused subgraph based on a query

• Step 2.• Iteratively calculate the authority value and hub value of the page in the subgraph

1/30/2019 Colorado State University, Spring 2019 Week 2-A-19

Step 1. Constructing a focused subgraph (root set)

• Generate a root set from a text-based search engine• e.g. pages containing query words

Root set

1/30/2019 Colorado State University, Spring 2019 Week 2-A-20

Step 2. Constructing a focused subgraph (base set)• For each page p∈R

• Add the set of all pages p points to• Add the set of all pages pointing to p

Base set

1/30/2019 Colorado State University, Spring 2019 Week 2-A-21

Step 3. Initial valuesNodes Hubs Authority

P1 1 1

P2 1 1

P3 1 1

P4 1 1

RanksHub: P1=P2=P3=P4Authority: P1=P2=P3=P4

P1

P3

P4P2

1/30/2019 Colorado State University, Spring 2019 Week 2-A-22

RanksHub: P1>P2>P3>P4Authority: P1=P2<P3<P4

P1

P3

P4P2

Step 4. After the first iterationNodes Hubs Authority

P1 7/22 1/8

P2 6/22 1/8

P3 5/22 2/8

P4 4/22 4/8

1/30/2019 Colorado State University, Spring 2019 Week 2-A-23

Normalization• Original paper: using squares sum (to 1)• You can use sum (to 1)

• value = value/(sum of all values)

Page 5: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 5

Step N. Convergence of scores

• Repeat the calculation (step 4) until the scores converge

• You should specify your threshold (maximum number of N)

1/30/2019 Colorado State University, Spring 2019 Week 2-A-24

Do we need to perform the matrix multiplication?• Yes/No

• It will be a valid answer• However, you can consider the random walk style implementation • Please see examples of PageRank algorithm provided by Apache Spark:• https://spark.apache.org/docs/1.6.1/api/java/org/apache/spark/graphx/lib/PageRank.html

1/30/2019 Colorado State University, Spring 2019 Week 2-A-25

3. Distributed Computing Models for Scalable Batch ComputingPart 1. MapReduce

1/30/2019 Colorado State University, Spring 2019 Week 2-A-26

3. Distributed Computing Models for Scalable Batch ComputingSection 1. MapReduce

a. Introduction to MapReduce

1/30/2019 Colorado State University, Spring 2019 Week 2-A-27

This material is developed based on,• Anand Rajaraman, Jure Leskovec, and Jeffrey

Ullman, “Mining of Massive Datasets”, Cambridge University Press, 2012 --Chapter 2• Download this chapter from the CS435 schedule page

• Hadoop: The definitive Guide, Tom White, O’Reilly, 3rd Edition, 2014

• MapReduce Design Patterns, Donald Miner and Adam Shook, O’Reilly, 2013

1/30/2019 Colorado State University, Spring 2019 Week 2-A-28

What is MapReduce?

1/30/2019 Colorado State University, Spring 2019 Week 2-A-29

Page 6: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 6

MapReduce [1/2]• MapReduce is inspired by the concepts of map and reduce in Lisp.

• “Modern” MapReduce• Developed within Google as a mechanism for processing large amounts of raw data.

• Crawled documents or web request logs

• Distributes these data across thousands of machines • Same computations are performed on each CPU with different dataset

1/30/2019 Colorado State University, Spring 2019 Week 2-A-30

MapReduce [2/2]• MapReduce provides an abstraction that allows engineers to perform simple

computations while hiding the details of parallelization, data distribution, load balancing and fault tolerance

1/30/2019 Colorado State University, Spring 2019 Week 2-A-31

Mapper

• Mapper maps input key/value pairs to a set of intermediate key/value pairs• Maps are the individual tasks that transform input records into intermediate records

• The transformed intermediate records do not need to be of the same type as the input records

• A given input pair may map to zero or many output pairs

• The Hadoop MapReduce framework spawns one map task for each InputSplit generated by the

InputFormat for the job

1/30/2019 Colorado State University, Spring 2019 Week 2-A-32

Reducer• Reducer reduces a set of intermediate values which share a key to a smaller set of

values

• Reducer has 3 primary phases• Shuffle, sort and reduce

• Shuffle• Input to the reducer is the sorted output of the mappers• The framework fetches the relevant partition of the output of all the mappers via HTTP

• Sort• The framework groups input to the reducer by keys

1/30/2019 Colorado State University, Spring 2019 Week 2-A-33

MapReduce Example 1

1/30/2019 Colorado State University, Spring 2019 Week 2-A-34

Example 1: NCDC data example

• A national climate data center record

• Find the maximum temperature of a year (1900 ~ 1999 )

0057 332130 # USAF weather station identifier 99999 # WBAN weather station identifier 19500101 # observation date 0300 # observation time 4 + 51317 # latitude (degrees x 1000) + 028783 # longitude (degrees x 1000)FM-12 + 0171 # elevation (meters) 99999 V020 320 # wind direction (degrees) 1 # quality code N

1/30/2019 Colorado State University, Spring 2019 Week 2-A-35

Page 7: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 7

The first entries for 1990% ls raw/ 1990 | head 010010-99999-1990.gz 010014-99999-1990.gz 010015-99999-1990.gz 010016-99999-1990.gz 010017-99999-1990.gz 010030-99999-1990.gz 010040-99999-1990.gz 010080-99999-1990.gz 010100-99999-1990.gz 010150-99999-1990.gz

1/30/2019 Colorado State University, Spring 2019 Week 2-A-36

Analyzing the data with Unix Tools (1/2)• A program for finding the maximum recorded temperature by year from NCDC weather

records

#!/ usr/ bin/ env bash for year in all/* doecho -ne ` basename $ year .gz `”\t”gunzip -c $ year |

\awk '{ temp = substr( $ 0, 88, 5) + 0; q = substr( $ 0, 93, 1); if (temp != 9999 && q ~ /[01459]/ && temp > max)

max = temp } END { print max }'

Done

1/30/2019 Colorado State University, Spring 2019 Week 2-A-37

Analyzing the data with Unix Tools (2/2)• The script loops through the compressed year files

• Printing the year• Processing each file using awk

• Extracts two fields

• Air temperature and the quality code• Check if it is greater than the maximum value seen so far

% ./ max_temperature.sh1901 317 1902 2441903 289 1904 256 1905 283…

1/30/2019 Colorado State University, Spring 2019 Week 2-A-38

Results?• The complete run for the century took 42 minutes

• To speed up the processing• We need to run parts of the program in parallel• Process different years in different processes• What will be the problems?

1/30/2019 Colorado State University, Spring 2019 Week 2-A-39

Challenges • Dividing the work into equal-size pieces

• Data size per year?

• Combining the results from independent processes • Combining results and sorting by year?

• You are still limited by the processing capacity of a single machine (the worst one)!

1/30/2019 Colorado State University, Spring 2019 Week 2-A-40

Map and Reduce

• MapReduce works by breaking the processing into two phases

• The map phase

• The reduce phase

• Each phase has key-value pairs as input and output

• Programmers should specify

• Types of input/output key-values• The map function• The reduce function

1/30/2019 Colorado State University, Spring 2019 Week 2-A-41

Page 8: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 8

Visualizing the way the MapReduce works (1/3)

0067011990999991950051507004... 9999999N9 + 00001 +99999999999... 0043011990999991950051512004... 9999999N9 + 00221 +99999999999... 0043011990999991950051518004... 9999999N9-00111 +99999999999... 0043012650999991949032412004... 0500001N9 + 01111 +99999999999... 0043012650999991949032418004... 0500001N9 + 00781 +99999999999…

Sample lines of input data

(0, 0067011990999991950051507004...9999999N9 + 00001+ 99999999999...) (106, 0043011990999991950051512004...9999999N9 + 00221+ 99999999999...) (212, 0043011990999991950051518004...9999999N9-0011 1 + 99999999999...)

These lines are presented to the map function as the key-value pairs

The keys are the line offsets within the file (optional)

1/30/2019 Colorado State University, Spring 2019 Week 2-A-42

Visualizing the way the MapReduce works (2/3)

(1950, 0) (1950, 22) (1950, − 11) (1949, 111) (1949, 78)

The map function extracts the year and the air temperature and emit them as its output

(1949, [111, 78]) (1950, [0, 22, −11])

This output key-value pairs will be sorted (by key) and grouped by key Values passed to each reducer are NOT sortedOur reduce function will see the following input:

1/30/2019 Colorado State University, Spring 2019 Week 2-A-43

reducesuffle

Visualizing the way the MapReduce works (3/3)

(1949, 111)(1950, 22)

Reduce function iterates through the list and pick up the maximum reading

This is the final output

0067119

004311r030234

003891

(0,0067119… )

(106, 005764… )

(212, 3847623..)

(318, 384762..)

(1950,

0 )(1950,

22)

(1950, -

11)

(1949, [111,78])(1950, [0,22,-11]

(1949, 111)(1950, 22)

1949, 1111950, 22

outputmapinput

1/30/2019 Colorado State University, Spring 2019 Week 2-A-44

MapReduce Example 2

1/30/2019 Colorado State University, Spring 2019 Week 2-A-45

Example 2: WordCount [1/5]• For text files stored under usr/joe/wordcount/input, count the number of

occurrences of each word• How do files and directory look?

$ bin/hadoop dfs -ls /usr/joe/wordcount/input/ /usr/joe/wordcount/input/file01 /usr/joe/wordcount/input/file02

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file01 Hello World, Bye World!

$ bin/hadoop dfs -cat /usr/joe/wordcount/input/file02 Hello Hadoop, Goodbye to hadoop.

1/30/2019 Colorado State University, Spring 2019 Week 2-A-46

Example 2: WordCount [2/5]• Run the MapReduce application

$ bin/hadoop jar /usr/joe/wordcount.jar org.myorg.WordCount/usr/joe/wordcount/input /usr/joe/wordcount/output

$ bin/hadoop dfs -cat /usr/joe/wordcount/output/part-00000 Bye 1 Goodbye 1 Hadoop, 1 Hello 2 World! 1 World, 1 hadoop. 1 to 1

1/30/2019 Colorado State University, Spring 2019 Week 2-A-47

Page 9: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 9

Example 2: WordCount [3/5]

Mappers1. Read a line

2. Tokenize the string

3. Pass the

<key,value> output

to the reducer

Reducers1. Collect <key,value> pairs

sharing same key

2. Aggregate total number of

occurrences

What do you have to pass from the Mappers?

1/30/2019 Colorado State University, Spring 2019 Week 2-A-48

Example 2: WordCount [4/5]

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {

private final static IntWritable one = new IntWritable(1);private Text word = new Text();

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {

String line = value.toString();StringTokenizer tokenizer = new StringTokenizer(line);

while (tokenizer.hasMoreTokens()) {word.set(tokenizer.nextToken());context.write(word, one);

}}

}

1/30/2019 Colorado State University, Spring 2019 Week 2-A-49

Example 2: WordCount [5/5]public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {

public void reduce(Text key, Iterable<IntWritable> values, Context context)

throws IOException, InterruptedException {int sum = 0;for (IntWritable val : values) {

sum += val.get();}context.write(key, new IntWritable(sum));

}}

1/30/2019 Colorado State University, Spring 2019 Week 2-A-50

Exercise Design your map and reduce function to perform following data processing.

Find the 10 clients who spent the most electricity (kilowatt) for each zip code for the last month.

Files contain information about the last month only. The data is formatted as follows:{customerID, TAB, address, TAB, zipcode, TAB, electricity usage, LINEFEED}. Assume that each line will be used as the input to a Map function.

Question 1: What are the input/output/functionality of your mapper?

Question 2: What are the input/output/functionality of your reducer?

1/30/2019 Colorado State University, Spring 2019 Week 2-A-51

Answer• Assume that all the ClientIDs are unique.(1) MapperInput: <dummy key (e.g. file offset number), a line of the input file (e.g. customerID, TAB, address, TAB, zipcode, TAB, electricity usage, LINEFEED)> Functionality: Tokenize the string and retrieve the zip codeGenerate an output Output <zip code, [customer_ID, electricity usage]>(2) ReducerInput: <zip code, a list of [customer_ID, electricity usage]>Functionality: Scan the list of values and identify top 10 customers with highest electricity usagesOutput: <zip code, a list of customers>

1/30/2019 Colorado State University, Spring 2019 Week 2-A-52

Better Answer: Top-N design pattern• Assume that all the ClientIDs are unique.(1) MapperInput: <dummy key (e.g. file offset number), a line of the input file (e.g. customerID, TAB, address, TAB, zipcode, TAB, electricity usage, LINEFEED)> Functionality: Create a data structure (HashMap: local_top10) to store the local top 10 informationTokenize the string and retrieve the zip codeIf this client is considered as one of the local top 10 until this point, update local_top10.After the input split is completely scanned, generate output with local_top10.Output <zip code, local_top10.>(2) ReducerInput: <zip code, a list of [local_top10]>Functionality: Scan the list of values and identify top 10 customers with highest electricity usagesOutput: <zip code, a list of customers>• This approach will reduce the communication within your MR cluster significantly

1/30/2019 Colorado State University, Spring 2019 Week 2-A-53

Page 10: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 10

Better Answer: Top-N design pattern: More Info

1/30/2019 Colorado State University, Spring 2019 Week 2-A-54

• Structure of the Top-N pattern

Input Split

Filter Mapper

Input Split

Filter Mapper

TopTenReducer

Input Split

Filter Mapper

Input Split

Filter Mapper

TopTenoutput

Local Top 10

Local Top 10

Local Top 10

Local Top 10

Better Answer: Top-N design pattern: More Info

1/30/2019 Colorado State University, Spring 2019 Week 2-A-55

public static class TopTenMapper extends Mapper < Object, Text, NullWritable, Text > {

// Create TreeMap(s) for each Zip Code. You can maintain a HashMap: zip code as the key and TreeMap

// as the value. This example is only for the 1 zip codeprivate TreeMap < Integer, Text >

LocalTop10 = new TreeMap < Integer, Text >();

public void map(Object key, Text value, Context context) throws IOException, InterruptedException {

// Your code to extract zip code and other attributes, if there are multiple zip code you can

// retrieve corresponding TreeMap based on the zip code here.

// Your code to evaluate current electricity-usage: Add this value and remove the lowest value

LocalTop10.put(Integer.parseInt(electricity_usage), new Text(your_value));

if (repToRecordMap.size() > 10) {

repToRecordMap.remove(repToRecordMap.firstKey());

}

}

protected void cleanup(Context context) throws IOException, InterruptedException {

// Output our ten records to the reducers with a zipcode as the key

for (Text t : repToRecordMap.values()) {

context.write(zipcode, t);

}

}

}

Better Answer: Top-N design pattern: More Info

1/30/2019 Colorado State University, Spring 2019 Week 2-A-56

• A map function can generate 0 or more outputs.

• setup() and cleanup() are called for each Mapper and Reducer “only once”. So, if there

Are 20 mappers running (10,000 inputs each), the setup/cleanup will be called only 20 times.

• Example:

public void run(Context context) throws IOException, InterruptedException {

setup(context); try { while (context.nextKey()) {

reduce(context.getCurrentKey(), context.getValues(), context); } }

finally { cleanup(context); } }

Comparison with other systems• MPI vs. MapReduce

• MapReduce tries to collocate the data with the compute node• Data access is fast

• Data is local!

• Volunteer computing vs. MapReduce• SETI@home

• Using donated CPU time

• What are the differences between MapReduce vs. SETI@home?

1/30/2019 Colorado State University, Spring 2019 Week 2-A-57

MapReduce Data Flow

1/30/2019 Colorado State University, Spring 2019 Week 2-A-58

MapReduce data flow with a single reducer

Split 0 Mapsort

Split 1 Mapsort

Split 2 Mapsort

ReducePart

0HDFSReplication

Merge

copy

1/30/2019 Colorado State University, Spring 2019 Week 2-A-59

Page 11: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 11

MapReduce data flow with multiple reducers

Split 0 Mapsort

Split 1 Mapsort

Split 2 Mapsort

ReducePart

0HDFSReplication

Merge

ReducePart

1HDFSReplication

Merge

1/30/2019 Colorado State University, Spring 2019 Week 2-A-60CS480 A2 Introduction to Big Data - Spring 2015

Split 0

Split 1

Split 2

Split 3

Split 4

User Program

Master

Worker

Worker

Worker

Worker

Worker

Output file 0

Output file 1

Intermediate files(on local disks)

Reduce phase Output filesMap phaseInput files

1. Shards the input files into M pieces

2. Starts up many copies of program.

3. Assigns work

(2)(2) (2)

(3) (3)

(4) read

4. Read contents of the corresponding input shardParses & passes the key-value pair to the Map function

5. Buffered pairs are written to local diskLocation is reported to the Master, which forwards them to appropriate reducer

(5) Local write

6. Accesses the location notified by Master and perform reduce function

(6)

7. Local write

8. Wake up the userprogram

2/9/20141/30/2019 Colorado State University, Spring 2019 Week 2-A-61

Data locality optimization• Hadoop tries to run the map task on a node where the input data resides in HDFS

• Minimizes usage of cluster bandwidth

• If all replication nodes are running other map tasks• The job scheduler will look for a free map slot on a node in the same rack

1/30/2019 Colorado State University, Spring 2019 Week 2-A-62

Data movement in Map tasks

1/30/2019 Colorado State University, Spring 2019 Week 2-A-63

Shuffle• The process by which the system performs the sort and transfers the map outputs to

the reducers as inputs• MapReduce guarantees that the input to every reducer is sorted by key

1/30/2019 Colorado State University, Spring 2019 Week 2-A-64

Combiner functions• Minimize data transferred between map and reduce tasks

• Users can specify a combiner function • To be run on the map output• To replace the map output with the combiner output

1/30/2019 Colorado State University, Spring 2019 Week 2-A-65

Page 12: CS535 Big Data 1/30/2019 Week 2-B Sangmi Lee Pallickaracs535/slides/week2-B.pdf1/30/2019 Colorado State University, Spring 2019 Week 2-A18 Implementing Topic Search using HITS •Step

CS535 Big Data 1/30/2019 Week 2- B Sangmi Lee Pallickara

http://www.cs.colostate.edu/~cs535 Spring 2019 Colorado State University 12

Combiner example• Example (from the previous max temperature example)

• The first map produced,• (1950, 0), (1950, 20), (1950, 10)• The second map produced,• (1950, 25), (1950, 15)• The reduce function is called with a list of all the values, • (1950, [0, 20, 10, 25, 15])• Output will be,• (1950, 25)

• We may express the function as, • max(0, 20, 10, 25, 15)

= max( max(0, 20, 10), max(25, 15)) = max(20, 25) = 25

1/30/2019 Colorado State University, Spring 2019 Week 2-A-66

Combiner function• Run a local reducer over Map output

• Reduce the amount of data shuffled between the mappers and the reducers

• Combiner cannot replace the reduce function • Why?

1/30/2019 Colorado State University, Spring 2019 Week 2-A-67

Questions?

1/30/2019 Colorado State University, Spring 2019 Week 2-A-68