cis 455/555: internet and web systems

© 2021 A. Haeberlen, Z. Ives, V. Liu

CIS 455/555: Internet and Web Systems

1University of Pennsylvania

MapReduce

October 20, 2021


Plan for today

n Google File Systemn Introduction to MapReduce

n Programming modeln Data flown Example tasks


NEXT


How Do We Get Parallelism in the Real World?Consider US Census

n There are ~330 million people in the USAn Suppose we are doing the census in person

n 10,000 employees, whose job is to collate census forms and to determine how many people live in each city

n How would you coordinate this task?

https://www.census.gov/programs-surveys/decennial-census/technical-documentation/questionnaires/2020.html



Basic Strategy for Canvassing in Parallel

n Send workers out in parallel

n They report back with a stack of filled out forms

n and find a next zone



Basic Strategy for Canvassing in Parallel

https://3danim8.files.wordpress.com/2017/07/usa-zip-code-map.jpg



The second part: grouping!

As we collect forms: they are from many places… Suppose we want to count by congressional district?

1. Sequential: One person sorts everything!2. Parallel: decompose the work into chunks and work

together:n Divide-and-conquer-based sorting, e.g., merge sortn Bucketing, e.g., bucket sorts, hashing



Mergesort in parallelFocusing on 2 workers

Worker 0

Worker 1

Worker 0

Worker 1

Worker 0

Worker 1

Worker 0



Hashing in parallel Focusing on 2 workers

Worker 0

Worker 1

Worker 0

Worker 1



Counting groups!

count()

count()

count()

count()

count()

4

4

5

7

4

Worker 0

Worker 1

Worker 0

Worker 1

Worker 0



What if some of the data is bad?n Assign a task to everyone as they collect census forms –

filter anything that doesn’t pass a sanity check

What about if some people finish far before others?

n Break into many more tasks than we have peoplen Have a centralized coordinator (scheduler) for the remaining

work!

Can we do partial counts?

A few wrinkles

123

211 710

University of Pennsylvania


A dataflow diagram – independent of # of workers

filter count()

count()

count()

count()

count()

4 orange

4 green

5 blue

7 cyan

4 gray

group aggregate



Summary of the Intuitions

For very particular kinds of data collecting tasks –we have a highly parallel scheme

n Data fetch, filtern Partition the data into groups – by mergesort or hashingn Aggregate each group

More workers allows more tasks to be done in parallel – up to the maximum number of tasks that don’t have a data dependency!

Let’s now formalize this with a computational framework…



MapReduce

n Wouldn't it be nice if there were some system that took care of all these details for you?n But every task is different!n Or is it? The details are different (what to compute, etc.),

but the data flow is often the same!n Maybe we can have a 'generic' solution?

n Ideally, you'd just tell the system what needs to be done

n That's the MapReduce framework.13



What is MapReduce?n A famous distributed programming modeln In many circles, considered the key building block for

much of Google’s data analysisn A programming language built on it: Sawzall,

http://labs.google.com/papers/sawzall.htmln … Sawzall has become one of the most widely used programming languages at

Google. … [O]n one dedicated Workqueue cluster with 1500 Xeon CPUs, there were 32,580 Sawzall jobs launched, using an average of 220 machines each. While running those jobs, 18,636 failures occurred (application failure, network outage, system crash, etc.) that triggered rerunning some portion of the job. The jobs read a total of 3.2x1015 bytes of data (2.8PB) and wrote 9.9x1012 bytes (9.3TB).

n Other similar languages: Yahoo’s Pig Latin and Pig; Microsoft’s Dryad

n Cloned in open source: Hadoop,http://hadoop.apache.org/core/

University of Pennsylvania14

http://labs.google.com/papers/sawzall.html

http://hadoop.apache.org/core/


The MapReduce programming modeln Simple distributed functional programming primitivesn Modeled after Lisp primitives:

map (apply function f to each item x in a collection, creating a new collection with f(x) in its place) andreduce (apply function to set of items with a common key)

n We start with:n A user-defined function to be applied to all data,

map: (item_key, value) à (stack_key, value’)n Another user-specified operation

reduce: (stack_key, {set of value’}) à resultn A set of n nodes, each with data

n All nodes run map on their data, producing new data with keysn This data is collected by key, then there is an implicit shuffle stage,

and finally a reducen Dataflow is through temp files on GFS



Simple example: Word count

n Goal: Given a set of documents, count how often each word occursn Input: Key-value pairs (document:lineNumber, text)n Output: Key-value pairs (word, #occurrences)n What should be the intermediate key-value pairs?


map(String key, String value) {// key: document name, line no// value: contents of line

}

reduce(String key, Iterator values) {

}

for each word w in value:emit(w, "1")

// key: a word// values: a list of countsint result = 0;for each v in values:result += ParseInt(v);

emit(key, result)

16

Key designquestion!


Simple example: Word count


Mapper(1-2)

Mapper(3-4)

Mapper(5-6)

Mapper(7-8)

Reducer(A-G)

Reducer(H-N)

Reducer(O-U)

Reducer(V-Z)

(1, the apple)(2, is an apple)(3, not an orange)(4, because the)(5, orange)(6, unlike the apple)(7, is orange)(8, not green)

(the, 1)

(apple, 1)

(is, 1)

(apple, 1)(an, 1)

(not, 1)

(orange, 1)

(an, 1)(because, 1)

(the, 1)(orange, 1)

(unlike, 1)

(apple, 1)

(the, 1)

(is, 1)

(orange, 1)

(not, 1)

(green, 1)

(apple, 3)(an, 2)

(because, 1)(green, 1)

(is, 2)(not, 2)

(orange, 3)(the, 3)

(unlike, 1)

(apple, {1, 1, 1})(an, {1, 1})

(because, {1})(green, {1})

(is, {1, 1})(not, {1, 1})

(orange, {1, 1, 1})(the, {1, 1, 1})

(unlike, {1})

Each mapper receives some of the KV-pairs

as input

The mappersprocess the

KV-pairs one by one

Each KV-pair outputby the mapper is sent to the reducer that is

responsible for it

The reducers sort their input

by key and group it

The reducers process their

input one groupat a time

1 2 3 4 5

Key range the node is responsible for


MapReduce dataflow


Mapper

Mapper

Mapper

Mapper

Reducer

Reducer

Reducer

Reducer

Inpu

t dat

a

Outp

ut d

ata

"The Shuffle"

Intermediate (key,value) pairs

What makes this so scalable?

In practice, mappers andreducers usually run on

the same set of machines!


MapReduce system components

n To make this work, we need a few more parts…

n The file system (distributed across all nodes):n Stores the inputs, outputs, and temporary results

n The driver program (executes on one node):n Specifies where to find the inputs, the outputsn Specifies what mapper and reducer to usen Can customize behavior of the execution

n The runtime system (controls nodes):n Supervises the execution of tasksn Esp. JobTracker

19


The Underlying MapReduce data flow

Data partitionsby key

Map computation partitions

Reduce computation

partitions

Redistributionby output’s key

("shuffle")

Coordinator


(Default MapReduce uses Filesystem)

20


Observe That…


n All data is key/value pairs – a simple binary tuple

n Map can be thought of as a bolt that, upon each tuple, filters / restructures the tuple

n Reduce can be thought of a bolt that buffers tuples with a common key

n Connection between map and reduce is done by something analogous to a “fieldGrouping”


What if a node crashes?n How will we know?

n Master pings every worker periodically

n What to do when a worker crashes?n Failed map task on node A: Reexecute on another node B,

and notify all the workers executing reduce tasksn If reduce task has not read all the data from A yet, it will read from B

n Failed reduce task: If not complete yet, reexecute on another node

n Intermediate outputs from map tasks are stored locally on the mapper, whereas outputs from reduce tasks are in the distributed file system

n What to do when the master crashes?n Could periodically checkpoint state & restart from theren Or just abort the computation - is this a good idea?



Other challenges

n Localityn Try to schedule map task on machine that already has data

n Task granularityn How many map tasks? How many reduce tasks?

n Dealing with stragglersn Schedule some backup tasks

n Saving bandwidthn E.g., with combiners

n Handling bad recordsn "Last gasp" packet with current sequence number



MapReduce as Stream Processing with End-of-Stream


File Spout MapBolt

File Spout MapBolt

ReduceBolt

ReduceBolt

4

4

2

2

StreamRouter(FieldBased orRoundRobin)

StreamRouter(FieldBased)

WorkerServer WorkerServer

Printer Bolt

StreamRouter(First)

shuffleGrouping in StormLiteuses hashing to group

Master(Spark

webapp)

POST WorkerJob in JSON to WorkerServersGet updates from WorkerServer background thread

file.0…

file.1…

eos

eos


Summary: MapReduce


Three major stages:n map items individually, outputting 0 or more records with

keysn shuffle records by keysn reduce the entries for each key

n Naturally distributes + parallelizesn Can create multi-stage pipelines, loops, etc.

to implement richer algorithms


Plan for today

n Google File Systemn Introduction to MapReduce

n Programming modeln Data flown Example tasks

n Hadoop and HDFSn Architecturen Using Hadoopn Using HDFSn Beyond MapReduce


NEXT


Programming for MapReduce


n Programming for MapReduce is very much like a callback-based programming model

n map() gets called for each input recordn reduce() gets called for each group

n Internally, the outputs of map() get sorted by a key

n Important: don’t make assumptions about what is shared across calls to map() or reduce()!


Beyond word countn Distributed grep – all lines matching a pattern



Input: (k,v) where k is __________ and v is __________

map(key : __________, value : __________) {

}reduce(key : __________, values: __________) {

}

Output: (k,v) where k is __________ and v is _________



Beyond word countn Distributed grep – all lines matching a pattern

n Map: filter by patternn Reduce: output set

n Count URL access frequencyn Map: output each URL as key, with count 1n Reduce: sum the counts

n Reverse web-link graphn Map: output (target,source) pairs when link to target

found in soucen Reduce: concatenates values and emits (target,list(source))

n Inverted indexn Map: Emits (word,documentID)n Reduce: Combines these into (word,list(documentID))


cis 455/555: internet and web systems

Documents