basics of cloud computing –lecture 5 mapreduce algorithms

36
Basics of Cloud Computing – Lecture 5 MapReduce Algorithms Satish Srirama Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation Commons Attribution 3.0 License)

Upload: others

Post on 09-Jun-2022

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Basics of Cloud Computing – Lecture 5

MapReduce Algorithms

Satish Srirama

Some material adapted from slides by Jimmy Lin, 2008 (licensed under Creation

Commons Attribution 3.0 License)

Page 2: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Outline

• Course projects

• MapReduce algorithms

• How to write algorithms

3.05.2011 Satish Srirama 2

Page 3: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Projects

• Have you selected your project topic?

• This week you have to send me a mail

[email protected]

– Subject: “BCCProject:FirstName_LastName:ProjectNo”– Subject: “BCCProject:FirstName_LastName:ProjectNo”

– The title of the project

– Description in your words, what you want to achieve

• One paragraph

– Put cc: [email protected]

3.05.2011 Satish Srirama 3

Page 4: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Projects - continued

• Preliminary proposal of the projects (10 mins)

– Select one of the dates 11.05 or 12.05

• Deliverables

– Present your project in a couple of sentences– Present your project in a couple of sentences

– Your progress/problems

– Used technologies

– Show preliminary results and demo, if any

3.05.2011 Satish Srirama 4

Page 5: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Projects - continued

• Final presentation (20 min each)

– 2 day workshop

– Select one of the dates 25.05 or 26.05 for presentation

• Deliverables

– Manual

– Writing

– 10 min presentation

– Demonstrate your work

3.05.2011 Satish Srirama 5

Page 6: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

MapReduce recap

• Very simple programming model

– Map (anything) -> key, value

– sort, partition on key

– Reduce (key, value) -> key, value– Reduce (key, value) -> key, value

• Metaphor

– input | map | shuffle | reduce > output

– cat * | awk | sort | uniq -c > file

3.05.2011 Satish Srirama 6

Page 7: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Hadoop Usage Patterns

• Extract, transform, and load (ETL)

– Perform aggregations, transformation, normalizations on the data (e.g. Log files) and load into RDBMS/ data mart

• Reporting and analytics

– Run ad-hoc queries, analytics and data mining operations – Run ad-hoc queries, analytics and data mining operations on large data

• Data processing pipelines

• Machine learning & Graph algorithms

– Implement machine learning algorithms on huge data sets

– Traverse large graphs and data sets, building models and classifiers

3.05.2011 Satish Srirama 7

Page 8: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

MapReduce Examples

• Distributed Grep

• Count of URL Access Frequency

• Reverse Web-Link Graph

• Term-Vector per Host • Term-Vector per Host

• Inverted Index

• Distributed Sort

3.05.2011 Satish Srirama 8

Page 9: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

MapReduce Jobs

• Tend to be very short, code-wise

– IdentityReducer is very common

• “Utility” jobs can be composed

• Represent a data flow, more so than a • Represent a data flow, more so than a

procedure

3.05.2011 Satish Srirama 9

Page 10: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Count of URL Access Frequency

• Processing web access logs

• Very similar to word count

• Map

– processes logs of web page requests and outputs – processes logs of web page requests and outputs

<URL, 1>

• Reduce

– adds together all values

– emits a <URL, total count> pairs

3.05.2011 Satish Srirama 10

Page 11: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Distributed Grep

• Map

– Emits a line if it matches a supplied pattern

• Reduce

– Identity function – Identity function

– Just copies the supplied intermediate data to the

output

3.05.2011 Satish Srirama 11

Page 12: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Reverse Web-Link Graph

• Map

– Outputs <target, source> pairs

– for each link to a target URL found in a page

named source.named source.

• Reduce

– Concatenates the list of all source URLs

– Returns <target, list(source)>

3.05.2011 Satish Srirama 12

Page 13: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Sort: Inputs

• A set of files, one value per line.

• Mapper key is file name, line number

• Mapper value is the contents of the line

3.05.2011 Satish Srirama 13

Page 14: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Sort Algorithm

• Takes advantage of reducer properties:

– (key, value) pairs are processed in order by key;

reducers are themselves ordered

• Mapper: Identity function for value• Mapper: Identity function for value

(k, v) -> (v, _)

• Reducer: Identity function (k’, _) -> (k’, “”)

3.05.2011 Satish Srirama 14

Page 15: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Sort: The Trick

• (key, value) pairs from mappers are sent to a

particular reducer based on hash(key)

• Must pick the hash function for your data such

that that

– K1 < K2 => hash(K1) < hash(K2)

• Used as a test of Hadoop’s raw speed

3.05.2011 Satish Srirama 15

Page 16: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Inverted Index: Inputs

• A set of files containing lines of text

• Mapper key is file name, line number

• Mapper value is the contents of the line• Mapper value is the contents of the line

3.05.2011 Satish Srirama 16

Page 17: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Inverted Index Algorithm

• Mapper: For each word in (file, words),

map to (word, file)

• Reducer: Identity function• Reducer: Identity function

3.05.2011 Satish Srirama 17

Page 18: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Index MapReduce

• map(pageName, pageText):

foreach word w in pageText:

emit Intermediate(w, pageName);

Done

• reduce(word, values):

foreach pageName in values:

AddToOutputList(pageName);

Done

emitFinal(FormattedPageListForWord);

3.05.2011 Satish Srirama 18

Page 19: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Index: Data Flow

This page

contains

so much of

text

Page A A map output

This : A

page : A

contains : A

so : A

much : A

of : A

Reduced output

This : A, B

page : A, B

too : B

3.05.2011 Satish Srirama 19

This page

too

contains

some text

Page B

of : A

text : A

B map output

This : B

page : B

too : B

contains : B

some : B

text : A

too : B

contains : A, B

so : A

much : A

of : A

text : A, B

some : B

Page 20: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Let us focus much bigger problemsLet us focus much bigger problems

3.05.2011 Satish Srirama 20

Page 21: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Managing Dependencies

• Remember: Mappers run in isolation

– You have no idea in what order the mappers run

– You have no idea on what node the mappers run

– You have no idea when each mapper finishes

• Tools for synchronization:• Tools for synchronization:

– Ability to hold state in reducer across multiple key-value pairs

– Sorting function for keys

– Partitioner

– Cleverly-constructed data structures

3.05.2011 Satish Srirama 21

Page 22: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Motivating Example

• Term co-occurrence matrix for a text collection

– M = N x N matrix (N = vocabulary size)

– Mij: number of times i and j co-occur in some context (for concreteness, let’s say context = sentence)(for concreteness, let’s say context = sentence)

• Why?

– Distributional profiles as a way of measuring semantic distance

– Semantic distance useful for many language processing tasks

“You shall know a word by the company it keeps” (Firth, 1957)

e.g., Mohammad and Hirst (EMNLP, 2006)

3.05.2011 22Satish Srirama

Page 23: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

MapReduce: Large Counting Problems

• Term co-occurrence matrix for a text collection= specific instance of a large counting problem

– A large event space (number of terms)

– A large number of events (the collection itself)– A large number of events (the collection itself)

– Goal: keep track of interesting statistics about the events

• Basic approach

– Mappers generate partial counts

– Reducers aggregate partial counts

How do we aggregate partial counts efficiently?

3.05.2011 23Satish Srirama

Page 24: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

First Try: “Pairs”

• Each mapper takes a sentence:

– Generate all co-occurring term pairs

– For all pairs, emit (a, b) → count

• Reducers sums up counts associated with • Reducers sums up counts associated with

these pairs

• Use combiners!

3.05.2011 24Satish Srirama

Page 25: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

“Pairs” Analysis

• Advantages

– Easy to implement, easy to understand

• Disadvantages

– Lots of pairs to sort and shuffle around (upper – Lots of pairs to sort and shuffle around (upper

bound?)

3.05.2011 25Satish Srirama

Page 26: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Another Try: “Stripes”

• Idea: group together pairs into an associative array(a, b) → 1

(a, c) → 2

(a, d) → 5

(a, e) → 3

(a, f) → 2

a → { b: 1, c: 2, d: 5, e: 3, f: 2 }

• Each mapper takes a sentence:

– Generate all co-occurring term pairs

– For each term, emit a → { b: countb, c: countc, d: countd … }

• Reducers perform element-wise sum of associative arrays a → { b: 1, d: 5, e: 3 }

a → { b: 1, c: 2, d: 2, f: 2 }

a → { b: 2, c: 2, d: 7, e: 3, f: 2 }+

3.05.2011 26Satish Srirama

Page 27: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

“Stripes” Analysis

• Advantages

– Far less sorting and shuffling of key-value pairs

– Can make better use of combiners

• Disadvantages• Disadvantages

– More difficult to implement

– Underlying object is more heavyweight

– Fundamental limitation in terms of size of event

space

3.05.2011 27Satish Srirama

Page 28: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Cluster size: 38 cores

Data Source: Associated Press Worldstream (APW) of the English Gigaword Corpus (v3), which

contains 2.27 million documents (1.8 GB compressed, 5.7 GB uncompressed)

3.05.2011 28Satish Srirama

Page 29: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Conditional Probabilities

• How do we compute conditional probabilities

from counts?

∑==

)',(count

),(count

)(count

),(count)|(

BA

BA

A

BAABP

• How do we do this with MapReduce?

∑'

)',(count)(countB

BAA

3.05.2011 29Satish Srirama

Page 30: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

P(B|A): “Pairs”

(a, b1) → 3

(a, b2) → 12

(a, b3) → 7

(a, b4) → 1

(a, *) → 32

(a, b1) → 3 / 32

(a, b2) → 12 / 32

(a, b3) → 7 / 32

(a, b4) → 1 / 32

Reducer holds this value in memory

• For this to work:– Must emit extra (a, *) for every bn in mapper

– Must make sure all a’s get sent to same reducer (use Partitioner)

– Must make sure (a, *) comes first (define sort order)

3.05.2011 30Satish Srirama

Page 31: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

P(B|A): “Stripes”

• Easy!

– One pass to compute (a, *)

a → {b1:3, b2 :12, b3 :7, b4 :1, … }

– One pass to compute (a, *)

– Another pass to directly compute P(B|A)

3.05.2011 31Satish Srirama

Page 32: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Synchronization in Hadoop

• Approach 1: turn synchronization into an ordering problem– Sort keys into correct order of computation

– Partition key space so that each reducer gets the appropriate set of partial results

– Hold state in reducer across multiple key-value pairs to perform computationcomputation

– Illustrated by the “pairs” approach

• Approach 2: construct data structures that “bring the pieces together”– Each reducer receives all the data it needs to complete the

computation

– Illustrated by the “stripes” approach

3.05.2011 32Satish Srirama

Page 33: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Issues and Tradeoffs

• Number of key-value pairs

– Object creation overhead

– Time for sorting and shuffling pairs across the network

• Size of each key-value pair• Size of each key-value pair

– De/serialization overhead

• Combiners make a big difference!

– RAM vs. disk and network

– Arrange data to maximize opportunities to aggregate

partial results

3.05.2011 33Satish Srirama

Page 34: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Complex Data Types in Hadoop

• How do you implement complex data types?

• The easiest way:– Encoded it as Text, e.g., (a, b) = “a:b”

– Use regular expressions to parse and extract data

– Works, but pretty hack-ish– Works, but pretty hack-ish

• The hard way:– Define a custom implementation of

WritableComprable

– Must implement: readFields, write, compareTo

– Computationally efficient, but slow for rapid prototyping

3.05.2011 34Satish Srirama

Page 35: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

Next Lecture

• Next week there is no lecture (10.05)

• More MapReduce

• Let us look at information retrieval

3.05.2011 Satish Srirama 35

Page 36: Basics of Cloud Computing –Lecture 5 MapReduce Algorithms

References

• Cloudera – Hadoop training

http://www.cloudera.com/developers/learn-

hadoop/training/

• J. Dean and S. Ghemawat, “MapReduce: • J. Dean and S. Ghemawat, “MapReduce:

Simplified Data Processing on Large Clusters”,

OSDI'04: Sixth Symposium on Operating System

Design and Implementation, Dec, 2004.

• Todo:

– Projects…

3.05.2011 Satish Srirama 36