10-605: map-reduce workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to...

58
10-605: Map-Reduce Workflows William Cohen 1

Upload: others

Post on 06-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

10-605: Map-Reduce Workflows

William Cohen

1

Page 2: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

PARALLELIZING STREAM AND SORT

2

Page 3: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

• example 1• example 2• example 3•….

Counting logic

Counter Machine 1

Part

ition

/Sor

t

Spill 1

Spill 2

Spill 3

• example 1• example 2• example 3•….

Counting logic

Counter Machine 2 Part

ition

/Sor

t

Spill 1

Spill 2

Spill 3

• C[x1] += D1• C[x1] += D2•….

Logic to combine counter updates

Reducer Machine 1

Mer

ge S

pill

File

s

• C[x2] += D1• C[x2] += D2•….

combine counter updates

Reducer Machine 2

Mer

ge S

pill

File

s

Spill n

3

Stream and Sort Counting à Distributed Counting

Page 4: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

• example 1• example 2• example 3•….

Counting logic

Map Process 1

Part

ition

/Sor

t

Spill 1

Spill 2

Spill 3

• example 1• example 2• example 3•….

Counting logic

Map Process 2 Part

ition

/Sor

t

Spill 1

Spill 2

Spill 3

• C[x1] += D1• C[x1] += D2•….

Logic to combine counter updates

Reducer 1

Mer

ge S

pill

File

s

• C[x2] += D1• C[x2] += D2•….

combine counter updates

Reducer 2

Mer

ge S

pill

File

s

Spill n

4

Distributed Stream-and-Sort: Map, Shuffle-Sort, Reduce

Distributed Shuffle-Sort

Page 5: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Combiners in Hadoop

5

Page 6: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Some of this is wasteful• Remember - moving data around and

writing to/reading from disk are very expensive operations

• No reducer can start until:• all mappers are done • data for its partition has been received• data in its partition has been sorted

6

Page 7: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

How much does buffering help?

BUFFER_SIZE Time Message Sizenone 1.7M words100 47s 1.2M1,000 42s 1.0M10,000 30s 0.7M100,000 16s 0.24M1,000,000 13s 0.16Mlimit 0.05M

Recall idea here: in stream-and-sort, use a buffer to accumulate counts in messages for common words before the sort so sort input was smaller

7

Page 8: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Combiners• Sits between the map and the shuffle– Do some of the reducing while you’re waiting for

other stuff to happen– Avoid moving all of that data over the network– Eg, for wordcount: instead of sending (word,1)

send (word,n) where n is a partial count (over data seen by that mapper)• Reducer still just sums the counts

• Only applicable when – order of reduce operations doesn’t matter (since

order is undetermined)– effect is cumulative

8

Page 9: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

9

Page 10: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

job.setCombinerClass(Reduce.class);

10

Page 11: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Deja vu: Combiner = Reducer• Often the combiner is the reducer.– like for word count–but not always

– remember you have no control over when/whether the combiner is applied

11

Page 12: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Algorithms in Map-Reduce

(Workflows)

12

Page 13: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Scalable out-of-core classification (of large test sets)

can we do betterthat the current approach?

13

Page 14: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Testing Large-vocab Naïve Bayes

• For each example id, y, x1,….,xd in train:• Sort the event-counter update “messages”• Scan and add the sorted messages and output the final

counter values• Initialize a HashSet NEEDED and a hashtable C• For each example id, y, x1,….,xd in test:– Add x1,….,xd to NEEDED

• For each event, C(event) in the summed counters– If event involves a NEEDED term x read it into C

• For each example id, y, x1,….,xd in test:– For each y’ in dom(Y):

• Compute log Pr(y’,x1,….,xd) = ….

[For assignment]

14

test is small

collection of event counts is big

Page 15: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Can we do better?

id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...

X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……

524510542120

373

Test data Event counts

id1 w1,1 w1,2 w1,3 …. w1,k1

id2 w2,1 w2,2 w2,3 ….

id3 w3,1 w3,2 ….

id4 w4,1 w4,2 …

C[X=w1,1^Y=sports]=5245, C[X=w1,1^Y=..],C[X=w1,2^…]

C[X=w2,1^Y=….]=1054,…, C[X=w2,k2^…]

C[X=w3,1^Y=….]=…

What we’d like

15

Page 16: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Can we do better?

X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……

524510542120

373

Event counts

w Counts associated with Waardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

Step 1: group counters by word wHow:• Stream and sort:

• for each C[X=w^Y=y]=n• print “w C[Y=y]=n”

• sort and build a list of values associated with each key wLike an inverted index

16

Page 17: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

If these records were in a key-value DB we would know what to do….

id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...

Test data Record of all event counts for each word

w Counts associated with Waardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews

Step 2: stream through and for each test case

idi wi,1 wi,2 wi,3 …. wi,ki

request the event counters needed to classify idi from the event-count DB, then classify using the answers

Classification logic

17

Page 18: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Is there a stream-and-sort analog of this request-and-answer pattern?

id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...

Test data Record of all event counts for each word

w Counts associated with Waardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews

Step 2: stream through and for each test case

idi wi,1 wi,2 wi,3 …. wi,ki

request the event counters needed to classify idi from the event-count DB, then classify using the answers

Classification logic

18

Page 19: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Recall: Stream and Sort Counting: sort messages so the recipient can stream through them

• example 1• example 2• example 3•….

Counting logic

“C[x] +=D”

Machine A

Sort

• C[x1] += D1• C[x1] += D2•….

Logic to combine counter updates

Machine C

Machine B19

Page 20: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Is there a stream-and-sort analog of this request-and-answer pattern?

id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...

Test data Record of all event counts for each word

w Counts associated with Waardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews

Classification logic

W1,1 counters to id1W1,2 counters to id1…Wi,j counters to idi

20

Page 21: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Is there a stream-and-sort analog of this request-and-answer pattern?

id1 found an aarvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..

Test data Record of all event counts for each word

w Counts associated with Waardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews

Classification logic

found ctrs to id1aardvark ctrs to id1…today ctrs to id1

21

Page 22: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Is there a stream-and-sort analog of this request-and-answer pattern?

id1 found an aarvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..

Test data Record of all event counts for each word

w Counts associated with Waardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews

Classification logic

found ~ctrs to id1aardvark ~ctrs to id1…today ~ctrs to id1

~ is the last ascii character

% export LC_COLLATE=C

means that it will sort after anything else with unix sort

22

Page 23: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Is there a stream-and-sort analog of this request-and-answer pattern?

id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..

Test data Record of all event counts for each word

w Counts associated with Waardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews

Classification logic

found ~ctr to id1aardvark ~ctr to id2…today ~ctr to idi

Counter records

requestsCombine and sort

23

Page 24: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

A stream-and-sort analog of the request-and-answer pattern…

Record of all event counts for each wordw Countsaardvark C[w^Y=sports]=2

agent …

…zynga …

found ~ctr to id1aardvark ~ctr to id1…today ~ctr to id1

Counter records

requestsCombine and sort

w Countsaardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742…zynga C[…]

zynga ~ctr to id1

Request-handling logic24

Page 25: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

A stream-and-sort analog of the request-and-answer pattern…

requestsCombine and sort

w Countsaardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742…zynga C[…]

zynga ~ctr to id1

Request-handling logic

•previousKey = somethingImpossible• For each (key,val) in input:

• If key==previousKey• Answer(recordForPrevKey,val)

• Else• previousKey = key• recordForPrevKey = val

define Answer(record,request):• find id where “request = ~ctr to id”• print “id ~ctr for request is record”

25

Page 26: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

A stream-and-sort analog of the request-and-answer pattern…

requestsCombine and sort

w Countsaardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742…zynga C[…]

zynga ~ctr to id1

Request-handling logic

•previousKey = somethingImpossible• For each (key,val) in input:

• If key==previousKey• Answer(recordForPrevKey,val)

• Else• previousKey = key• recordForPrevKey = val

define Answer(record,request):• find id where “request = ~ctr to id”• print “id ~ctr for request is record”

Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…

26

Page 27: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

A stream-and-sort analog of the request-and-answer pattern…

w Countsaardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742…zynga C[…]

zynga ~ctr to id1

Request-handling logic

Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…

id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..

Combine and sort ???? 27

Page 28: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

id1 w1,1 w1,2 w1,3 …. w1,k1

id2 w2,1 w2,2 w2,3 ….

id3 w3,1 w3,2 ….

id4 w4,1 w4,2 …

C[X=w1,1^Y=sports]=5245, C[X=w1,1^Y=..],C[X=w1,2^…]

C[X=w2,1^Y=….]=1054,…, C[X=w2,k2^…]

C[X=w3,1^Y=….]=…

Key Valueid1 found aardvark zynga farmville today

~ctr for aardvark is C[w^Y=sports]=2~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564…

id2 w2,1 w2,2 w2,3 …. ~ctr for w2,1 is …

… …

What we’d wanted

What we ended up with … and it’s good enough!

28

Page 29: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat - test.dat| sort | testNBUsingRequests

id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...

X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……

524510542120

373

train.dat counts.dat

29

Page 30: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat - test.dat| sort | testNBUsingRequests

words.dat

w Counts associated with Waardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

30

Page 31: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat - test.dat| sort | testNBUsingRequests

w Countsaardvark C[w^Y=sports]=2

agent …

…zynga …

found ~ctr to id1aardvark ~ctr to id2…today ~ctr to idi

w Countsaardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742

output looks like thisinput looks like this

words.dat

31

Page 32: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Implementation summary

java CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat - test.dat| sort | testNBUsingRequests

Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…

id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..

Output looks like this

test.dat

32

Page 33: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Implementation summaryjava CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat -test.dat| sort | testNBUsingRequests Input looks like this

Key Valueid1 found aardvark zynga farmville today

~ctr for aardvark is C[w^Y=sports]=2~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564…

id2 w2,1 w2,2 w2,3 …. ~ctr for w2,1 is …

… … 33

Page 34: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

ABSTRACTIONS FOR MAP-REDUCE

34

Page 35: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Abstractions On Top Of Map-Reduce• Some obvious streaming

processes: – for each row in a table

• Transform it and output the result

• Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test

35

Example:stemwordsinastreamofword-countpairs:

(“aardvarks”,1)è (“aardvark”,1)

Proposed syntax:

table2 =MAPtable1 TOλ row :f(row))

f(row)èrow’

Example:applystopwords

(“aardvark”,1)è (“aardvark”,1)(“the”,1)è deleted

Proposed syntax:

table2 =FILTERtable1 BYλ row :f(row))

f(row)è {true,false}

Page 36: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Abstractions On Top Of Map-Reduce• A non-obvious? streaming

processes: – for each row in a table

• Transform it to a list of items• Splice all the lists

together to get the output table (flatten)

36

Example:tokenizingaline

“Ifoundanaardvark”è [“i”,“found”,”an”,”aardvark”]“Welovezymurgy”è [“we”,”love”,”zymurgy”]

..butfinaltableisonewordperrow

“i”“found”“an”“aardvark”“we”“love”…

Proposed syntax:

table2 =FLATMAPtable1 TOλ row :f(row))

f(row)èlist of rows

Page 37: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

NB Test Step

X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……

524510542120

373

Event counts

How:• Stream and sort:

• for each C[X=w^Y=y]=n• print “w C[Y=y]=n”

• sort and build a list of values associated with each key wLike an inverted index

w Counts associated with Waardvark C[w^Y=sports]=2agent C[w^Y=sports]=1027,C[w^Y=world

News]=564

… …zynga C[w^Y=sports]=21,C[w^Y=worldNe

ws]=4464 37

Page 38: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

NB Test Step

X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……

524510542120

373

Event counts

w Counts associated with Waardvark C[w^Y=sports]=2agent C[w^Y=sports]=1027,C[w^Y=world

News]=564

… …zynga C[w^Y=sports]=21,C[w^Y=worldNe

ws]=4464

The general case:We’re taking rows from a table• In a particular format (event,count)Applying a function to get a new value• The word for the eventAnd grouping the rows of the table by this new value

èGrouping operationSpecial case of a map-reduce

Proposed syntax:

GROUPtable BYλ row :f(row)

Coulddefinef via:afunction,afieldofadefinedrecord structure,…

f(row)èfield

38

Page 39: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

NB Test StepThe general case:We’re taking rows from a table• In a particular format (event,count)Applying a function to get a new value• The word for the eventAnd grouping the rows of the table by this new value

èGrouping operationSpecial case of a map-reduce

Proposed syntax:

GROUPtable BYλ row :f(row)

Coulddefinef via:afunction,afieldofadefinedrecord structure,…

f(row)èfield

Aside: you guys know how to implement this, right?

1. Output pairs (f(row),row) with a map/streaming process

2. Sort pairs by key – which is f(row)

3. Reduce and aggregate by appending together all the values associated with the same key

39

Page 40: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Abstractions On Top Of Map-Reduce• And another example from the Naïve Bayes

test program…

40

Page 41: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Request-and-answer

id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...

Test data Record of all event counts for each word

w Counts associated with Waardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews

Step 2: stream through and for each test case

idi wi,1 wi,2 wi,3 …. wi,ki

request the event counters needed to classify idi from the event-count DB, then classify using the answers

Classification logic

41

Page 42: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Request-and-answer

• Break down into stages– Generate the data being requested (indexed by key, here

a word)• Eg with group … by

– Generate the requests as (key, requestor) pairs• Eg with flatmap … to

– Join these two tables by key• Join: conceptually defined as (1) cross-product and (2) filter out

pairs with different values for keys • Join: implemented by concatenating two different tables of key-

value pairs, and reducing them together– Postprocess the joined result

42

Page 43: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

w Countersaardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

w Counters Requestsaardvark C[w^Y=sports]=2 ~ctr to id1

agent C[w^Y=sports]=… ~ctr to id345agent C[w^Y=sports]=… ~ctr to id9854agent C[w^Y=sports]=… ~ctr to id345… C[w^Y=sports]=… ~ctr to id34742zynga C[…] ~ctr to id1

zynga C[…] …

w Requestfound ~ctr to id1

aardvark ~ctr to id1

…zynga ~ctr to id1… ~ctr to id2

43

Page 44: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

w Countersaardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

w Counters Requestsaardvark C[w^Y=sports]=2 id1

agent C[w^Y=sports]=… id345agent C[w^Y=sports]=… id9854agent C[w^Y=sports]=… id345… C[w^Y=sports]=… id34742zynga C[…] id1

zynga C[…] …

w Requestfound id1

aardvark id1

…zynga id1… id2

Proposed syntax:

JOINtable1 BYλ row :f(row),table2BY λ row :g(row)

Examples:

JOINwordInDoc BYword,wordCounters BY word--- ifword(row)definedcorrectly

JOINwordInDoc BYlambda(word,docid):word,wordCounters BYlambda(word,counters):word– usingpythonsyntaxforfunctions

44

Page 45: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Abstract Implementation: [TF]IDF

data = pairs (docid ,term) where term is a word appears in document with id docidoperators:• DISTINCT, MAP, JOIN• GROUP BY …. [RETAINING …] REDUCING TO a reduce step

docFreq = DISTINCT data| GROUP BY λ(docid,term):term REDUCING TO count /* (term,df) */

docIds = MAP DATA BY=λ(docid,term):docid | DISTINCTnumDocs = GROUP docIds BY λdocid:1 REDUCING TO count /* (1,numDocs) */

dataPlusDF = JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term

| MAP λ((docid,term),(term,df)):(docId,term,df) /* (docId,term,document-freq) */

unnormalizedDocVecs = JOIN dataPlusDF by λrow:1, numDocs by λrow:1| MAP λ((docId,term,df),(dummy,numDocs)): (docId,term,log(numDocs/df))/* (docId, term, weight-before-normalizing) : u */

1/2

docId termd123 foundd123 aardvark

key valuefound (d123,found),(d134,found),…aardvark (d123,aardvark),…

key value1 12451

key valuefound (d123,found),(d134,found),… 2456aardvark (d123,aardvark),… 7

45

question – how many reducers should I use here?

Page 46: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Abstract Implementation: [TF]IDF

data = pairs (docid ,term) where term is a word appears in document with id docidoperators:• DISTINCT, MAP, JOIN• GROUP BY …. [RETAINING …] REDUCING TO a reduce step

docFreq = DISTINCT data| GROUP BY λ(docid,term):term REDUCING TO count /* (term,df) */

docIds = MAP DATA BY=λ(docid,term):docid | DISTINCTnumDocs = GROUP docIds BY λdocid:1 REDUCING TO count /* (1,numDocs) */

dataPlusDF = JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term

| MAP λ((docid,term),(term,df)):(docId,term,df) /* (docId,term,document-freq) */

unnormalizedDocVecs = JOIN dataPlusDF by λrow:1, numDocs by λrow:1| MAP λ((docId,term,df),(dummy,numDocs)): (docId,term,log(numDocs/df))/* (docId, term, weight-before-normalizing) : u */

1/2

docId termd123 foundd123 aardvark

key valuefound (d123,found),(d134,found),…aardvark (d123,aardvark),…

key value1 12451

key valuefound (d123,found),(d134,found),… 2456aardvark (d123,aardvark),… 7

46

question – how many reducers should I use here?

question – how many reducers should I use here?

Page 47: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Abstract Implementation: TFIDF

normalizers = GROUP unnormalizedDocVecs BY λ(docId,term,w):docidRETAINING λ(docId,term,w): w2

REDUCING TO sum /* (docid,sum-of-square-weights) */

docVec = JOIN unnormalizedDocVecs BY λ(docId,term,w):docid,normalizers BY λ(docId,norm):docid

| MAP λ((docId,term,w), (docId,norm)): (docId,term,w/sqrt(norm))/* (docId, term, weight) */

2/2

keyd1234 (d1234,found,1.542), (d1234,aardvark,13.23),… 37.234d3214 ….

keyd1234 (d1234,found,1.542), (d1234,aardvark,13.23),… 37.234d3214 …. 29.654

docId term wd1234 found 1.542d1234 aardvark 13.23

docId wd1234 37.234d1234 37.234 47

Page 48: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Two ways to join

• Reduce-side join• Map-side join

48

Page 49: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Two ways to join

• Reduce-side join for A,B

49

term dffound 2456aardvark 7…

term docIdaardvark d15… ...found d7found d23found ……

A

B

concatand sort

A B(aardvark, 7) (aardvark,

d15)… ...

(found,2456) (found,d7)

(found,2456) (found,d23)

… …

do the join

(

Page 50: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Two ways to join

• Reduce-side join for A,B

50

term dffound 2456aardvark 7

term docIdaardvark d15

...found d7found d23found …

concatand sort

A B(aardvark, 7) (aardvark,

d15)… ...

(found,2456) (found,d7)

(found,2456) (found,d23)

… …

do the join

(

term dffound A 2456aardvark A 7…

term docId

aardvark B d15… ...found B d7found B d23found B ……

tricky bit: need sort by firsttwo values (aardvark, AB) –we want the DF’s to come first

but all tuples with key “aardvark” should go to same worker

Page 51: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Two ways to join

• Reduce-side join for A,B

51

term dffound 2456aardvark 7

term docIdaardvark d15

...found d7found d23found …

concatand sort

term dffound A 2456aardvark A 7…

term docId

aardvark B d15… ...found B d7found B d23found B ……

tricky bit: need sort by firsttwo values (aardvark, AB) –we want the DF’s to come first

but all tuples with key “aardvark” should go to same worker

custom sort (secondary sort key):Writeable with your own Comparator

custom Partitioner(specified for job like the Mapper, Reducer, ..)

Page 52: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Two ways to join• Map-side join–write the smaller relation out to disk– send it to each Map worker• DistributedCache

–when you initialize each Mapper, load in the small relation• Configure(…) is called at initialization time

–map through the larger relation and do the join– faster but requires one relation to go in

memory

52

Page 53: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Two ways to join

• Map-side join for A (small) and B (large)

53

term df

found 2456

aardvark 7

term docIdaardvark d15… ...found d7found d23found ……

A

B

Load into mapper A B

(aardvark, 7) (aardvark, d15)

… ...

(found,2456) (found,d7)(found,2456) (found,d23)… …

Join as you map

(

Page 54: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

Two ways to join

• Map-side join for A (small) and B (large)

54

term df

found 2456

aardvark 7

term docIdaardvark d15… ...

A

B1

Duplicate and load into every mapper

(found,2456) (found,d7)(found,2456) (found,d23)… …

Join as you map

(

B2 found d7found d23found ……

A B(aardvark, 7) (aardvark,

d15)… ...

Page 55: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

PIG: A WORKFLOW LANGUAGE

55

Page 56: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

PIG: word count example

PIG program is a bunch of assignments where every LHS is a relation.No loops, conditionals, etc allowed.

56

Page 57: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

57

Tokenize – built-in function

Flatten – special keyword, which applies to the nextstep in the process – so output is a stream of words w/o document boundaries

Built-in regex matching ….

Page 58: 10-605: Map-Reduce Workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to classify id ifrom the event-count DB, then classify using the answers Classification

58

Group by … foreach generate count(…) will be optimized into a single map-reduce

Group produces a stream of bags of identical words… bags, tuples, ictionaries are primitive types