10-605: map-reduce workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to...

10-605: Map-Reduce Workflows

William Cohen

1

PARALLELIZING STREAM AND SORT

2

• example 1• example 2• example 3•….

Counting logic

Counter Machine 1

Part

ition

/Sor

t

Spill 1

Spill 2

Spill 3

…


Counting logic

Counter Machine 2 Part

ition

/Sor

t

Spill 1

Spill 2

Spill 3

…

• C[x1] += D1• C[x1] += D2•….

Logic to combine counter updates

Reducer Machine 1

Mer

ge S

pill

File

s

• C[x2] += D1• C[x2] += D2•….

combine counter updates

Reducer Machine 2

Mer

ge S

pill

File

s

Spill n

3

Stream and Sort Counting à Distributed Counting


Counting logic

Map Process 1

Part

ition

/Sor

t

Spill 1

Spill 2

Spill 3

…


Counting logic

Map Process 2 Part

ition

/Sor

t

Spill 1

Spill 2

Spill 3

…

• C[x1] += D1• C[x1] += D2•….


Reducer 1

Mer

ge S

pill

File

s

• C[x2] += D1• C[x2] += D2•….

combine counter updates

Reducer 2

Mer

ge S

pill

File

s

Spill n

4

Distributed Stream-and-Sort: Map, Shuffle-Sort, Reduce

Distributed Shuffle-Sort

Combiners in Hadoop

5

Some of this is wasteful• Remember - moving data around and

writing to/reading from disk are very expensive operations

• No reducer can start until:• all mappers are done • data for its partition has been received• data in its partition has been sorted

6

How much does buffering help?

BUFFER_SIZE Time Message Sizenone 1.7M words100 47s 1.2M1,000 42s 1.0M10,000 30s 0.7M100,000 16s 0.24M1,000,000 13s 0.16Mlimit 0.05M

Recall idea here: in stream-and-sort, use a buffer to accumulate counts in messages for common words before the sort so sort input was smaller

7

Combiners• Sits between the map and the shuffle– Do some of the reducing while you’re waiting for

other stuff to happen– Avoid moving all of that data over the network– Eg, for wordcount: instead of sending (word,1)

send (word,n) where n is a partial count (over data seen by that mapper)• Reducer still just sums the counts

• Only applicable when – order of reduce operations doesn’t matter (since

order is undetermined)– effect is cumulative

8

job.setCombinerClass(Reduce.class);

10

Deja vu: Combiner = Reducer• Often the combiner is the reducer.– like for word count–but not always

– remember you have no control over when/whether the combiner is applied

11

Algorithms in Map-Reduce

(Workflows)

12

Scalable out-of-core classification (of large test sets)

can we do betterthat the current approach?

13

Testing Large-vocab Naïve Bayes

• For each example id, y, x1,….,xd in train:• Sort the event-counter update “messages”• Scan and add the sorted messages and output the final

counter values• Initialize a HashSet NEEDED and a hashtable C• For each example id, y, x1,….,xd in test:– Add x1,….,xd to NEEDED

• For each event, C(event) in the summed counters– If event involves a NEEDED term x read it into C

• For each example id, y, x1,….,xd in test:– For each y’ in dom(Y):

• Compute log Pr(y’,x1,….,xd) = ….

[For assignment]

14

test is small

collection of event counts is big

Can we do better?

id1 w1,1 w1,2 w1,3 …. w1,k1id2 w2,1 w2,2 w2,3 …. id3 w3,1 w3,2 …. id4 w4,1 w4,2 …id5 w5,1 w5,2 …...

X=w1^Y=sportsX=w1^Y=worldNewsX=..X=w2^Y=…X=……

524510542120

373

…

Test data Event counts

id1 w1,1 w1,2 w1,3 …. w1,k1

id2 w2,1 w2,2 w2,3 ….

id3 w3,1 w3,2 ….

id4 w4,1 w4,2 …

C[X=w1,1^Y=sports]=5245, C[X=w1,1^Y=..],C[X=w1,2^…]

C[X=w2,1^Y=….]=1054,…, C[X=w2,k2^…]

C[X=w3,1^Y=….]=…

…

What we’d like

15

Can we do better?


524510542120

373

…

Event counts

w Counts associated with Waardvark C[w^Y=sports]=2

agent C[w^Y=sports]=1027,C[w^Y=worldNews]=564

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews]=4464

Step 1: group counters by word wHow:• Stream and sort:

• for each C[X=w^Y=y]=n• print “w C[Y=y]=n”

• sort and build a list of values associated with each key wLike an inverted index

16

If these records were in a key-value DB we would know what to do….


Test data Record of all event counts for each word


agent C[w^Y=sports]=1027,C[w^Y=worldNews

… …zynga C[w^Y=sports]=21,C[w^Y=worldNews

Step 2: stream through and for each test case

idi wi,1 wi,2 wi,3 …. wi,ki

request the event counters needed to classify idi from the event-count DB, then classify using the answers

Classification logic

17

Is there a stream-and-sort analog of this request-and-answer pattern?










18

Recall: Stream and Sort Counting: sort messages so the recipient can stream through them


Counting logic

“C[x] +=D”

Machine A

Sort

• C[x1] += D1• C[x1] += D2•….


Machine C

Machine B19








W1,1 counters to id1W1,2 counters to id1…Wi,j counters to idi

…

20


id1 found an aarvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..






found ctrs to id1aardvark ctrs to id1…today ctrs to id1

…

21


id1 found an aarvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..






found ~ctrs to id1aardvark ~ctrs to id1…today ~ctrs to id1

…

~ is the last ascii character

% export LC_COLLATE=C

means that it will sort after anything else with unix sort

22


id1 found an aardvark in zynga’s farmville today!id2 …id3 ….id4 …id5 …..






found ~ctr to id1aardvark ~ctr to id2…today ~ctr to idi

…

Counter records

requestsCombine and sort

23

A stream-and-sort analog of the request-and-answer pattern…

Record of all event counts for each wordw Countsaardvark C[w^Y=sports]=2

agent …

…zynga …

found ~ctr to id1aardvark ~ctr to id1…today ~ctr to id1

…

Counter records


w Countsaardvark C[w^Y=sports]=2

aardvark ~ctr to id1

agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742…zynga C[…]

zynga ~ctr to id1

Request-handling logic24






zynga ~ctr to id1

Request-handling logic

•previousKey = somethingImpossible• For each (key,val) in input:

• If key==previousKey• Answer(recordForPrevKey,val)

• Else• previousKey = key• recordForPrevKey = val

define Answer(record,request):• find id where “request = ~ctr to id”• print “id ~ctr for request is record”

25






zynga ~ctr to id1


•previousKey = somethingImpossible• For each (key,val) in input:

• If key==previousKey• Answer(recordForPrevKey,val)

• Else• previousKey = key• recordForPrevKey = val

define Answer(record,request):• find id where “request = ~ctr to id”• print “id ~ctr for request is record”

Output:id1 ~ctr for aardvark is C[w^Y=sports]=2…id1 ~ctr for zynga is ….…

26





zynga ~ctr to id1




Combine and sort ???? 27

id1 w1,1 w1,2 w1,3 …. w1,k1

id2 w2,1 w2,2 w2,3 ….

id3 w3,1 w3,2 ….

id4 w4,1 w4,2 …

C[X=w1,1^Y=sports]=5245, C[X=w1,1^Y=..],C[X=w1,2^…]

C[X=w2,1^Y=….]=1054,…, C[X=w2,k2^…]

C[X=w3,1^Y=….]=…

…

Key Valueid1 found aardvark zynga farmville today

~ctr for aardvark is C[w^Y=sports]=2~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564…

id2 w2,1 w2,2 w2,3 …. ~ctr for w2,1 is …

… …

What we’d wanted

What we ended up with … and it’s good enough!

28




words.dat




30





agent …

…zynga …

found ~ctr to id1aardvark ~ctr to id2…today ~ctr to idi

…



agent C[w^Y=sports]=…agent ~ctr to id345agent ~ctr to id9854… ~ctr to id345agent ~ctr to id34742

output looks like thisinput looks like this

words.dat

31






Output looks like this

test.dat

32

Implementation summaryjava CountForNB train.dat … > eventCounts.datjava CountsByWord eventCounts.dat | sort | java CollectRecords > words.dat

java requestWordCounts test.dat| cat - words.dat | sort | java answerWordCountRequests| cat -test.dat| sort | testNBUsingRequests Input looks like this

Key Valueid1 found aardvark zynga farmville today

~ctr for aardvark is C[w^Y=sports]=2~ctr for found is C[w^Y=sports]=1027,C[w^Y=worldNews]=564…

id2 w2,1 w2,2 w2,3 …. ~ctr for w2,1 is …

… … 33

ABSTRACTIONS FOR MAP-REDUCE

34

Abstractions On Top Of Map-Reduce• Some obvious streaming

processes: – for each row in a table

• Transform it and output the result

• Decide if you want to keep it with some boolean test, and copy out only the ones that pass the test

35

Example:stemwordsinastreamofword-countpairs:

(“aardvarks”,1)è (“aardvark”,1)

Proposed syntax:

table2 =MAPtable1 TOλ row :f(row))

f(row)èrow’

Example:applystopwords

(“aardvark”,1)è (“aardvark”,1)(“the”,1)è deleted

Proposed syntax:

table2 =FILTERtable1 BYλ row :f(row))

f(row)è {true,false}

Abstractions On Top Of Map-Reduce• A non-obvious? streaming

processes: – for each row in a table

• Transform it to a list of items• Splice all the lists

together to get the output table (flatten)

36

Example:tokenizingaline

“Ifoundanaardvark”è [“i”,“found”,”an”,”aardvark”]“Welovezymurgy”è [“we”,”love”,”zymurgy”]

..butfinaltableisonewordperrow

“i”“found”“an”“aardvark”“we”“love”…

Proposed syntax:

table2 =FLATMAPtable1 TOλ row :f(row))

f(row)èlist of rows

NB Test Step


524510542120

373

…

Event counts

How:• Stream and sort:

• for each C[X=w^Y=y]=n• print “w C[Y=y]=n”

• sort and build a list of values associated with each key wLike an inverted index

w Counts associated with Waardvark C[w^Y=sports]=2agent C[w^Y=sports]=1027,C[w^Y=world

News]=564

… …zynga C[w^Y=sports]=21,C[w^Y=worldNe

ws]=4464 37

NB Test Step


524510542120

373

…

Event counts

w Counts associated with Waardvark C[w^Y=sports]=2agent C[w^Y=sports]=1027,C[w^Y=world

News]=564

… …zynga C[w^Y=sports]=21,C[w^Y=worldNe

ws]=4464

The general case:We’re taking rows from a table• In a particular format (event,count)Applying a function to get a new value• The word for the eventAnd grouping the rows of the table by this new value

èGrouping operationSpecial case of a map-reduce

Proposed syntax:

GROUPtable BYλ row :f(row)

Coulddefinef via:afunction,afieldofadefinedrecord structure,…

f(row)èfield

38

NB Test StepThe general case:We’re taking rows from a table• In a particular format (event,count)Applying a function to get a new value• The word for the eventAnd grouping the rows of the table by this new value

èGrouping operationSpecial case of a map-reduce

Proposed syntax:

GROUPtable BYλ row :f(row)

Coulddefinef via:afunction,afieldofadefinedrecord structure,…

f(row)èfield

Aside: you guys know how to implement this, right?

1. Output pairs (f(row),row) with a map/streaming process

2. Sort pairs by key – which is f(row)

3. Reduce and aggregate by appending together all the values associated with the same key

39

Abstractions On Top Of Map-Reduce• And another example from the Naïve Bayes

test program…

40

Request-and-answer










41

Request-and-answer

• Break down into stages– Generate the data being requested (indexed by key, here

a word)• Eg with group … by

– Generate the requests as (key, requestor) pairs• Eg with flatmap … to

– Join these two tables by key• Join: conceptually defined as (1) cross-product and (2) filter out

pairs with different values for keys • Join: implemented by concatenating two different tables of key-

value pairs, and reducing them together– Postprocess the joined result

42

w Countersaardvark C[w^Y=sports]=2



w Counters Requestsaardvark C[w^Y=sports]=2 ~ctr to id1

agent C[w^Y=sports]=… ~ctr to id345agent C[w^Y=sports]=… ~ctr to id9854agent C[w^Y=sports]=… ~ctr to id345… C[w^Y=sports]=… ~ctr to id34742zynga C[…] ~ctr to id1

zynga C[…] …

w Requestfound ~ctr to id1


…zynga ~ctr to id1… ~ctr to id2

43

w Countersaardvark C[w^Y=sports]=2



w Counters Requestsaardvark C[w^Y=sports]=2 id1

agent C[w^Y=sports]=… id345agent C[w^Y=sports]=… id9854agent C[w^Y=sports]=… id345… C[w^Y=sports]=… id34742zynga C[…] id1

zynga C[…] …

w Requestfound id1

aardvark id1

…zynga id1… id2

Proposed syntax:

JOINtable1 BYλ row :f(row),table2BY λ row :g(row)

Examples:

JOINwordInDoc BYword,wordCounters BY word--- ifword(row)definedcorrectly

JOINwordInDoc BYlambda(word,docid):word,wordCounters BYlambda(word,counters):word– usingpythonsyntaxforfunctions

44

Abstract Implementation: [TF]IDF

data = pairs (docid ,term) where term is a word appears in document with id docidoperators:• DISTINCT, MAP, JOIN• GROUP BY …. [RETAINING …] REDUCING TO a reduce step

docFreq = DISTINCT data| GROUP BY λ(docid,term):term REDUCING TO count /* (term,df) */

docIds = MAP DATA BY=λ(docid,term):docid | DISTINCTnumDocs = GROUP docIds BY λdocid:1 REDUCING TO count /* (1,numDocs) */

dataPlusDF = JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term

| MAP λ((docid,term),(term,df)):(docId,term,df) /* (docId,term,document-freq) */

unnormalizedDocVecs = JOIN dataPlusDF by λrow:1, numDocs by λrow:1| MAP λ((docId,term,df),(dummy,numDocs)): (docId,term,log(numDocs/df))/* (docId, term, weight-before-normalizing) : u */

1/2

docId termd123 foundd123 aardvark

key valuefound (d123,found),(d134,found),…aardvark (d123,aardvark),…

key value1 12451

key valuefound (d123,found),(d134,found),… 2456aardvark (d123,aardvark),… 7

45

question – how many reducers should I use here?

Abstract Implementation: [TF]IDF

data = pairs (docid ,term) where term is a word appears in document with id docidoperators:• DISTINCT, MAP, JOIN• GROUP BY …. [RETAINING …] REDUCING TO a reduce step

docFreq = DISTINCT data| GROUP BY λ(docid,term):term REDUCING TO count /* (term,df) */

docIds = MAP DATA BY=λ(docid,term):docid | DISTINCTnumDocs = GROUP docIds BY λdocid:1 REDUCING TO count /* (1,numDocs) */

dataPlusDF = JOIN data BY λ(docid, term):term, docFreq BY λ(term, df):term

| MAP λ((docid,term),(term,df)):(docId,term,df) /* (docId,term,document-freq) */

unnormalizedDocVecs = JOIN dataPlusDF by λrow:1, numDocs by λrow:1| MAP λ((docId,term,df),(dummy,numDocs)): (docId,term,log(numDocs/df))/* (docId, term, weight-before-normalizing) : u */

1/2

docId termd123 foundd123 aardvark

key valuefound (d123,found),(d134,found),…aardvark (d123,aardvark),…

key value1 12451

key valuefound (d123,found),(d134,found),… 2456aardvark (d123,aardvark),… 7

46



Abstract Implementation: TFIDF

normalizers = GROUP unnormalizedDocVecs BY λ(docId,term,w):docidRETAINING λ(docId,term,w): w2

REDUCING TO sum /* (docid,sum-of-square-weights) */

docVec = JOIN unnormalizedDocVecs BY λ(docId,term,w):docid,normalizers BY λ(docId,norm):docid

| MAP λ((docId,term,w), (docId,norm)): (docId,term,w/sqrt(norm))/* (docId, term, weight) */

2/2

keyd1234 (d1234,found,1.542), (d1234,aardvark,13.23),… 37.234d3214 ….

keyd1234 (d1234,found,1.542), (d1234,aardvark,13.23),… 37.234d3214 …. 29.654

docId term wd1234 found 1.542d1234 aardvark 13.23

docId wd1234 37.234d1234 37.234 47

Two ways to join

• Reduce-side join• Map-side join

48

Two ways to join

• Reduce-side join for A,B

49

term dffound 2456aardvark 7…

term docIdaardvark d15… ...found d7found d23found ……

A

B

concatand sort

A B(aardvark, 7) (aardvark,

d15)… ...

(found,2456) (found,d7)


… …

do the join

(

Two ways to join


50

term dffound 2456aardvark 7

term docIdaardvark d15

...found d7found d23found …

concatand sort


d15)… ...



… …

do the join

(

term dffound A 2456aardvark A 7…

term docId

aardvark B d15… ...found B d7found B d23found B ……

tricky bit: need sort by firsttwo values (aardvark, AB) –we want the DF’s to come first

but all tuples with key “aardvark” should go to same worker

Two ways to join


51

term dffound 2456aardvark 7

term docIdaardvark d15

...found d7found d23found …

concatand sort

term dffound A 2456aardvark A 7…

term docId

aardvark B d15… ...found B d7found B d23found B ……

tricky bit: need sort by firsttwo values (aardvark, AB) –we want the DF’s to come first

but all tuples with key “aardvark” should go to same worker

custom sort (secondary sort key):Writeable with your own Comparator

custom Partitioner(specified for job like the Mapper, Reducer, ..)

Two ways to join• Map-side join–write the smaller relation out to disk– send it to each Map worker• DistributedCache

–when you initialize each Mapper, load in the small relation• Configure(…) is called at initialization time

–map through the larger relation and do the join– faster but requires one relation to go in

memory

52

Two ways to join

• Map-side join for A (small) and B (large)

53

term df

found 2456

aardvark 7

…

term docIdaardvark d15… ...found d7found d23found ……

A

B

Load into mapper A B

(aardvark, 7) (aardvark, d15)

… ...

(found,2456) (found,d7)(found,2456) (found,d23)… …

Join as you map

(

Two ways to join

• Map-side join for A (small) and B (large)

54

term df

found 2456

aardvark 7

…

term docIdaardvark d15… ...

A

B1

Duplicate and load into every mapper

(found,2456) (found,d7)(found,2456) (found,d23)… …

Join as you map

(

B2 found d7found d23found ……


d15)… ...

PIG: A WORKFLOW LANGUAGE

55

PIG: word count example

PIG program is a bunch of assignments where every LHS is a relation.No loops, conditionals, etc allowed.

56

57

Tokenize – built-in function

Flatten – special keyword, which applies to the nextstep in the process – so output is a stream of words w/o document boundaries

Built-in regex matching ….

58

Group by … foreach generate count(…) will be optimized into a single map-reduce

Group produces a stream of bags of identical words… bags, tuples, ictionaries are primitive types

10-605: map-reduce workflowswcohen/10-605/workflows-1.pdf · request the event counters needed to...

Documents