experiences teaching mapreduce in the clouds ari rabkin, charles reiss, randy katz, david patterson...

32
Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

Upload: ira-jennings

Post on 23-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

1

Experiences Teaching MapReducein the Clouds

Ari Rabkin, Charles Reiss,Randy Katz, David Patterson

University of California, Berkeley

Page 2: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

2

Introduction: What we did

• Hadoop MapReduce performance benchmarking

• 300 students, 80 cores per student(in one semester)

• 2400 cores• Impossible without the cloud

Page 3: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

3

Context: Teaching varieties of parallelism

• Instruction (e.g. pipelining), Data (e.g. vector instructions), Request (e.g. replicated webservers), …

• We were teaching many of these in an sophomore course

• This talk focuses on task parallelism

Page 4: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

4

Task parallelism• Our example: MapReduce

• Sophomores wrote a MapReduceprogram and ran it in adistributed environment

• Observed speedup

• On a large dataset using real-world tools

<<

Page 5: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

5

Others have taught MapReduce

• As a programming paradigm [Johnson '08]• As part of a elective "big data" analysis course

[Aaron '08, Lin '10, Couch '10]

Page 6: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

6

Unlike prior work, we

• Cared about performance andits implementation on a cluster

• Taught sophomores• Emphasized cost and economics

Page 7: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

7

Outline

• Motivation: MapReduce and why it matters• Assignment goals and design• Experiences

o challenges for studentso challenges for instructors

Page 8: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

8

MapReduce: Why it matters

• Trend of "big data"o more data collection — smartphones, Internet

services, etc.o cheaper data storageo cheaper access to data processing capability

— public cloud computing providers• Dominant way to make sense of very large

datasets on commodity hardware is MapReduceo Google, Facebook, IBM, Amazon, many more, …

Page 9: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

9

MapReduce: Programming modelinput

input records (e.g. page from a web crawl)

group bylist of values for each key

"map": a function call per record

key-value pairs (e.g. word -> # of times in record)

output

"reduce": a function call per group

results for each key (e.g. word and its number of occurences)

Page 10: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

10

MapReduce: Distributed execution

Map task

Multiple "map", "reduce" calls per task

Input FilePartition

Input FilePartition

Input FilePartition

Output File

Output File

Map task

Map task

Reduce task

Reduce task

Page 11: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

11

Assignment goals

• Measure performanceo Observe parallel speedup

• Non-trivial use of MapReduceo Multiple stages: output of one MapReduce

program used as input to another• Off-the-shelf tools

o Hadoop (standard industry platform,open source)

Page 12: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

12

Why we used cloud computing

• Datacenter-like resources to hundreds of studentso Performance isolationo Complement teaching about datacenter

architecture• Maximum actual usage of >2400 cores

o Larger than our instructional clusterso Interference with other instructional users

Page 13: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

13

Usage over time

Lab Projectdeadline

Page 14: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

14

Assignment (Spring)

• Two-stage — co-occurrence (“How associated is a target word with other words?”) +sorting (top-K)

• Java — native Hadoop API language• Dataset of Usenet posts —

8.4GB (compressed size)

inst.eecs.berkeley.edu/~cs61c/sp11/

Page 15: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

15

Assignment structure (Spring)1. Laboratory 1 — MapReduce programming

o Against native Hadoop APIo Running on lab machines only (not parallel)o Trivial MR tasks (fit in lab time)

2. Laboratory 2 — Measuring MR at scaleo Timing, calculations for existing MR programso Some design excersizes; no new coding

3. Project Part 1 — implement, run locally (smaller datasets)

4. Project Part 2 — time, get working at scale

Page 16: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

16

What students achieved= linear speedup

Page 17: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

17

Debugging difficulties

• First time efficiency mattered for many students

• Long runtime + remote execution Longer debugging cycleoReal-world problem

Page 18: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

18

EfficiencyMost students on par with reference solution

~10 minutes — time on input big enough for MapReduce to make sense

Hadoop not well-tuned for small inputs

on 40 cores

Page 19: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

19

Efficiency

But some students observed very bad performance

Waiting 40+ minutes for results which should take 10 minutes

on 40 cores

Page 20: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

20

Things we learned about our student Java

Integer numSeen;for (...) {  ...  numSeen += 1;}

for (each word in bigString) {    ...    if (bigString.contains(targetWord)) {         ...    }}

// and more...

Page 21: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

21

Using a public cloud provider

• Grant from Amazon ($100 credit/student)

• We wanted:o More capacity than we could provision

internallyo Students use cloud provider like

commercial user

Page 22: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

22

Using a public cloud provider

"Backup" billing even with grant

Page 23: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

23

What it cost (in grant credits)

Outliers:Usually misunderstood tools;tried restarting repeatedly after problems

Most student costs reasonableEach used a "dedicated" cluster of around 80 cores.

Page 24: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

24

Student satisfaction

• When surveyed, students ranked this project first among the three software projectso Most students (90% of responders)

recommended keeping the project in later semesters

• Students reported that this project impressed potential employers

Page 25: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

25

Conclusion/Lessons Learned

• Students wrote a parallel program and ran it against a large data seto Almost all students ran programs on large

datasets and observed parallel speedupso Early experience for sophomores debugging,

deploying programs with large datasets• First time that students write programs with

long enough run-time to measure efficiency• Public clouds allowed us to demonstrate scale

with low per-student costs

Page 26: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

26

Other CC uses: long-running servers

• Long-running servers per student or group• Web/service classesNo elasticity, low resource

usage — cost-effective?

Page 27: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

27

Other CC uses: VM per student

• Consistent infrastructure for development• Way to hand out/in assignments• With or without a “cloud” to host the VMs

Page 28: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

28

Other CC uses: static clusters

• Customized machines for a particular course• Sometimes done without cost benefit ---

cluster kept up for entire semester

Page 29: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

29

Page 30: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

30

Backup Slides

Page 31: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

31

Scripts

• https://github.com/woggling/ec2-wrappers

• Danger! Pre-alpha software!– Depends on Berkeley infrastructure in several

places– Could spend real money; do not use without

understanding– Requires some manual monitoring– Documentation is probably incomplete

Page 32: Experiences Teaching MapReduce in the Clouds Ari Rabkin, Charles Reiss, Randy Katz, David Patterson University of California, Berkeley 1

32

Using a public cloud provider

56%

44%