cs347: mapreduce cs347 1. 2 motivation for map-reduce distribution makes simple computations complex...

CS347: MapReduce

CS347

1

2

Motivation for Map-Reduce

Distribution makes simple computations complex•Communication•Load balancing•Fault tolerance•…

What if we could write “simple” programs that were automatically parallelized?

CS 347

3

R’1

R’2

R’3

ko

k1

Local sort

Local sort

Local sort

R1

R2

R3

Result

Motivation for Map-Reduce

Recall one of our sort strategies:

process data& partition

additionalprocessing

CS 347

4

Another example: Asymmetric fragment + replicate join

Ra S

S

S

Rb

R1

R2

R3

Sa

Sb

Local join

Result

fpartition

unionprocess data& partition

additionalprocessing

From our point of view…

• What if we didn’t have to think too hard about the number of sites or fragments?

• MapReduce goal: a library that hides the distribution from the developer, who can then focus on the fundamental “nuggets” of their computation

CS347

5

6

Building Text Index - Part I

1

2

3

rat

dogdogcat

rat

dog

(rat, 1)

(dog, 1)

(dog, 2)

(cat, 2)

(rat, 3)

(dog, 3)

(cat, 2)

(dog, 1)

(dog, 2)

(dog, 3)

(rat, 1)

(rat, 3)

Disk

Page strea

m

Tokenizing SortingLoading

FLUSHING

Intermediate runs

CS347

original Map-Reduce application....

7

Building Text Index - Part II

(cat, 2)

(dog, 1)

(dog, 2)

(dog, 3)

(rat, 1)

(rat, 3)

IntermediateRuns

(ant, 5)

(cat, 4)

(dog, 4)

(dog, 5)

(eel, 6)

Merge

(ant, 5)

(cat, 2)

(cat, 4)

(dog, 1)

(dog, 2)

(dog, 3)

(dog, 4)

(dog, 5)

(eel, 6)

(rat, 1)

(rat, 3)Final index

(ant: 2)

(cat: 2,4)

(dog: 1,2,3,4,5)

(eel: 6)

(rat: 1, 3)

CS347

8

Generalizing: Map-Reduce

1

2

3

rat

dogdogcat

rat

dog

(rat, 1)

(dog, 1)

(dog, 2)

(cat, 2)

(rat, 3)

(dog, 3)

(cat, 2)

(dog, 1)

(dog, 2)

(dog, 3)

(rat, 1)

(rat, 3)

Disk

Page strea

m

Tokenizing SortingLoading

FLUSHING

Intermediate runs

Map

CS347

9

Generalizing: Map-Reduce

(cat, 2)

(dog, 1)

(dog, 2)

(dog, 3)

(rat, 1)

(rat, 3)

IntermediateRuns

Final index

(ant, 5)

(cat, 4)

(dog, 4)

(dog, 5)

(eel, 6)

Merge

(ant, 5)

(cat, 2)

(cat, 4)

(dog, 1)

(dog, 2)

(dog, 3)

(dog, 4)

(dog, 5)

(eel, 6)

(rat, 1)

(rat, 3)

(ant: 2)

(cat: 2,4)

(dog: 1,2,3,4,5)

(eel: 6)

(rat: 1, 3)

Reduce

CS347

10

Map Reduce

• Input: R={r1, r2, ...rn}, functions M, R

– M(ri) { [k1, v1], [k2, v2],.. }

– R(ki, valSet) [ki, valSet’]

• Let S={ [k, v] | [k, v] M(r) for some r R }

• Let K = {k | [k,v] S, for any v }

• Let G(k) = { v | [k, v] S }

• Output = { [k, T] | k K, T=R(k, G(k)) }

S is bag

G is bag

CS347

11

References

• MapReduce: Simplified Data Processing on Large Clusters, Jeffrey Dean and Sanjay Ghemawat, available athttp://labs.google.com/papers/mapreduce-osdi04.pdf

CS347

12

Example: Counting Word Occurrences

• map(String doc, String value);// doc is document name// value is document contentfor each word w in value: EmitIntermediate(w, “1”);

• Example:– map(doc, “cat dog cat bat dog”) emits

[cat 1], [dog 1], [cat 1], [bat 1], [dog 1]

CS347

13

Example: Counting Word Occurrences

• reduce(String key, Iterator values);// key is a word// values is a list of countsint result = 0;for each v in values: result += ParseInt(v)Emit(AsString(result));

• Example:– reduce(“dog”, “1 1 1 1”) emits “4”

Becomes (“dog”, 4)CS347

Mappers

CS347

14

Sourcedata

Splitdata

Workers

Mappers

CS347

15

Splitdata

Workers

Mappers

CS347

16

Splitdata

Workers

Mappers

CS347

17

Splitdata

Workers

Mappers

CS347

18

Splitdata

Workers

Mappers

CS347

19

Splitdata

Workers

Mappers

CS347

20

Splitdata

Workers

Mappers

CS347

21

Splitdata

Workers

Mappers

CS347

22

Splitdata

Workers

Mappers

CS347

23

Splitdata

Workers

Mappers

CS347

24

Splitdata

Workers

Mappers

CS347

25

Splitdata

Workers

Shuffle

CS347

26

Workers

Shuffle

CS347

27

Workers

Reduce

CS347

28

Workers

Another way to think about it:

• Mapper: (some query)

• Shuffle: GROUP BY

• Reducer: SELECT Aggregate()

CS347

29

Doesn’t have to be relational

• Mapper: Parse data into K,V pairs

• Shuffle: Repartition by K

• Reducer: Transform the V’s for a K into a Vfinal

CS347

30

Process model

CS347

31

Process model

CS347

32

Can vary the number of mappers to tune performance

Reduce tasks bounded by number of reduce shards

33

Implementation Issues

• Combine function

• File system

• Partition of input, keys

• Failures

• Backup tasks

• Ordering of results

CS347

34

Combine Function

worker

worker

worker[cat 1], [cat 1], [cat 1]...

[dog 1], [dog 1]...

worker

worker

worker[cat 3]...

[dog 2]...

Combine is like a local reduce applied before distribution:

CS347

35

Data flow

worker must be able to access any part of input file; so input on distributed fs

reduce worker must be able to access local disks on map workers

any worker must be able to write its part of answer; answer is left as distributed file

CS347

High-throughput network is essential

36

Partition of input, keys

• How many workers, partitions of input file?

9

321

worker

worker

worker

How many splits?How many workers? Best to have many splits per worker: Improves load balance; if worker fails, easier to spread its tasks

Should workers be assigned to splits “near” them?

Similar questions for reduce workers

CS347

What takes time?

• Reading input data• Mapping data• Shuffling data• Reducing data

• Map and reduce are separate phases– Latency determined by slowest task– Reduce shard skew can increase latency

• Map and shuffle can be overlapped– But if lots intermediate data, shuffle may be slow

CS347

37

38

Failures

• Distributed implementation should produce same output as would have been produced by a non-faulty sequential execution of the program.

• General strategy: Master detects worker failures, and has work re-done by another worker.

worker

worker

split j

masterok?

redo j

CS347

39

Backup Tasks

• Straggler is a machine that takes unusually long (e.g., bad disk) to finish its work.

• A straggler can delay final completion.• When task is close to finishing, master schedules

backup executions for remaining tasks.

Must be able to eliminateredundant results

CS347

40

Ordering of Results

• Final result (at each node) is in key order

[k1, T1][k2, T2][k3, T3][k4, T4]

[k1, v1][k3, v3]

also in key order:

CS347

41

Example: Sorting Records

5 e3 c6 f

2 b1 a6 f*

3 c5 e

6 f

1 a

2 b6 f*

8 h4 d9 i7 g

7 g9 i

4 d8 h

1 a3 c5 e7 g9 i

2 b4 d6 f6 f*8 h

W1

W2

W3

W5

W6

Map: extract k, output [k, record]Reduce: Do nothing!

one or two records for

k=6?

CS347

42

Other Issues

• Skipping bad records

• Debugging

CS347

43

MR Claimed Advantages

• Model easy to use, hides details of parallelization, fault recovery

• Many problems expressible in MR framework• Scales to thousands of machines

CS347

44

MR Possible Disadvantages

• 1-input 2-stage data flow rigid, hard to adapt to other scenarios

• Custom code needs to be written even for the most common operations, e.g., projection and filtering

• Opaque nature of map, reduce functions impedes optimization

CS347

45

Hadoop

• Open-source Map-Reduce system• Also, a toolkit

– HDFS – filesystem for Hadoop

– HBase – Database for Hadoop

• Also, an ecosystem– Tools

– Recipes

– Developer community

CS347

MapReduce pipelines

• Output of one MapReduce becomes input for another– Example:

• Stage 1: Translate data to canonical form

• Stage 2: Term count

• Stage 3: Sort by frequency

• Stage 4: Extract top-k

CS347

46

Make it database-y?

• Simple idea: each operator is a MapReduce

• How to do:– Select– Project– Group by, aggregate– Join

CS347

47

Reduce-side join

• Shuffle puts all values for the same key at the same reducer

• Mapper– Input: tuples from R; tuples from S– Output: (join value, (R|S, tuple))

• Reducer– Local join of all R tuples with all S tuples

CS347

48

Map-side join

• Like a hash-join, but every mapper has a copy of the hash table

• Mapper:– Read in hash table of R

– Input: Tuples of S

– Output: Tuples of S joined with tuples of R

• Reducer– Pass through

CS347

49

Comparison

• Reduce-side join shuffles all the data

• Map-side join requires one table to be small

CS347

50

Semi-join?

• One idea:– MapReduce 1: Extract the join keys of R;

reduce-side join with S– MapReduce 2: Map-side join result of

MapReduce 1 with R

CS347

51

Platforms for SQL-like queries

• Pig Latin

• Hive

• MapR

• MemSQL

• …

CS347

52

Why not just use a DBMS?• Many DBMSs exist and are highly optimized

• A comparison of approaches to large-scale data analysis. Pavlo et al, SIGMOD 2009

CS347

53

Why not just use a DBMS?• One reason: loading data into a DBMS is hard

• A comparison of approaches to large-scale data analysis. Pavlo et al, SIGMOD 2009

CS347

54

Why not just use a DBMS?

• Other possible reasons:– MapReduce is more scalable

– MapReduce is more easily deployed

– MapReduce is more easily extended

– MapReduce is more easily optimized

– MapReduce is free (that is, Hadoop)

– I already know Java

– MapReduce is exciting and new

CS347

55

Data store

• Instead of HDFS, data could be stored in a database

• Example: HadoopDB/Hadapt– Data store is PostgresSQL– Allows for indexing, fast local query processing

HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. Azza Abouzeid, Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, Alex Rasin. In Proceedings of VLDB, 2009.

CS347

56

Batch processing?

• MapReduce materializes intermediate results– Map output written to disk before reduce starts

• What if we pipelined the data from map to reduce?– Reduce could quickly produce approximate answers

– MapReduce could implement continuous queries

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joe Hellerstein, Khaled Elmeleegy, Russell Sears. NSDI 2010

CS347

57

Not just database queries!

• Any large, partition-able computation– Build an inverted index– Image thumbnailing– Machine translation– …

– Anything expressible in the Map/Reduce functions via general purpose language

CS347

58

cs347: mapreduce cs347 1. 2 motivation for map-reduce distribution makes simple computations complex...

Documents