matei zaharia , dhruba borthakur * , joydeep sen sarma * ,

44
Matei Zaharia, Dhruba Borthakur * , Joydeep Sen Sarma * , Khaled Elmeleegy + , Scott Shenker, Ion Stoica UC Berkeley, * Facebook Inc, + Yahoo! Research Fair Scheduling for MapReduce Clusters UC Berkeley

Upload: audrey-holman

Post on 04-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Fair Scheduling for MapReduce Clusters. Matei Zaharia , Dhruba Borthakur * , Joydeep Sen Sarma * , Khaled Elmeleegy + , Scott Shenker , Ion Stoica. UC Berkeley, * Facebook Inc, + Yahoo! Research. UC Berkeley. Motivation. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*,Khaled Elmeleegy+, Scott Shenker, Ion Stoica

UC Berkeley, *Facebook Inc, +Yahoo! Research

Fair Scheduling for MapReduce Clusters

UC Berkeley

Page 2: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Motivation• MapReduce / Hadoop originally designed for high

throughput batch processing

• Today’s workload is far more diverse:– Many users want to share a cluster

• Engineering, marketing, business intelligence, etc

– Vast majority of jobs are short• Ad-hoc queries, sampling, periodic reports

– Response time is critical• Interactive queries, deadline-driven reports

How can we efficiently share MapReduce clusters between users?

Page 3: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

1 10 100 1000 10000 1000000%

20%

40%

60%

80%

100%

Job Running Time (s)

CDF

Example: Hadoop at Facebook• 2000-node, 20 PB data warehouse, growing at 40 TB/day• Applications: data mining, spam detection, ads• 200 users (half non-engineers)• 7500 MapReduce jobs / day

Median: 84s

Page 4: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Approaches to Sharing

• Hadoop default scheduler (FIFO)– Problem: short jobs get stuck behind long ones

• Separate clusters– Problem 1: poor utilization– Problem 2: costly data replication

• Full replication nearly infeasible at FB / Yahoo! scale• Partial replication prevents cross-dataset queries

Page 5: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Our Work: Hadoop Fair Scheduler

• Fine-grained sharing at level of map & reduce tasks• Predictable response times and user isolation• Example with 3 job pools:

Page 6: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

What This Talk is About

• Two challenges for fair sharing in MapReduce– Data locality (conflicts with fairness)– Task interdependence (reduces can’t finish until

maps are done, leading to slot hoarding)

• Some solutions

• Current work: Mesos project

Page 7: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Outline

• Challenge 1: data locality– Solution: delay scheduling

• Challenge 2: reduce slot hoarding– Solution: copy-compute splitting

• Current work: Mesos

Page 8: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

The Problem

Job 2Master Job 1

Scheduling order

Slave Slave Slave SlaveSlave Slave

42

1 1 22 33

95 33 6756 9 48 782 1 1

Task 2 Task 5 Task 3 Task 1 Task 7 Task 4

File 1:

File 2:

Page 9: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

The Problem

Job 1Master Job 2

Scheduling order

Slave Slave Slave SlaveSlave Slave

42

1 22 3

95 33 6756 9 48 782 1 1

Task 2 Task 5 Task 3 Task 1

File 1:

File 2:

Task 1 Task 7Task 2 Task 4Task 3

Problem: Fair decision hurts locality

Especially bad for jobs with small input files

1 3

Page 10: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

10 100 1000 10000 1000000%

20%

40%

60%

80%

100%

Data Locality in Production at Facebook

Node Locality Rack Locality

Job Size (Number of Input Blocks)

Per

cent

Loc

al M

ap T

ask

sLocality vs. Job Size at Facebook

58% of jobs

Page 11: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Special Instance: Sticky Slots

• Under fair sharing, locality can be poor even when all jobs have large input files

• Problem: jobs get “stuck” in the same set of task slots– When one a task in job j finishes, the slot it was running in is

given back to j, because j is below its share– Bad because data files are spread out across all nodes

JobFair

ShareRunning

Tasks

Job 1 2 2

Job 2 2 2

JobFair

ShareRunning

Tasks

Job 1 2 1Job 2 2 2

Slave Slave Slave Slave

Master

Task Task Task Task

Page 12: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Special Instance: Sticky Slots

5 Jobs 10 Jobs 20 Jobs 50 Jobs0%

20%40%60%80%

100%

Data Locality vs. Number of Concurrent Jobs

% L

ocal

Map

Tas

ks• Under fair sharing, locality can be poor even when all

jobs have large input files• Problem: jobs get “stuck” in the same set of task slots

– When one a task in job j finishes, the slot it was running in is given back to j, because j is below its share

– Bad because data files are spread out across all nodes

Page 13: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Solution: Delay Scheduling

• Relax queuing policy to make jobs wait for a limited time if they cannot launch local tasks

• Result: Very short wait time (1-5s) is enough to get nearly 100% locality

Page 14: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Delay Scheduling Example

Job 1Master Job 2

Scheduling order

Slave Slave Slave SlaveSlave Slave

42

1 1 22 33

95 33 6756 9 48 782 1 1

Task 2 Task 3

File 1:

File 2:

Task 8 Task 7Task 2 Task 4Task 6

Idea: Wait a short time to get data-local scheduling opportunities

Task 5Task 1 Task 1Task 3

Wait!

Page 15: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Delay Scheduling Details

• Scan jobs in order given by queuing policy, picking first that is permitted to launch a task

• Jobs must wait before being permitted to launch non-local tasks– If wait < T1, only allow node-local tasks

– If T1 < wait < T2, also allow rack-local

– If wait > T2, also allow off-rack

• Increase a job’s time waited when it is skipped

Page 16: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Analysis

• When is it worth it to wait, and for how long?

Waiting worth it if E(wait time) < cost to run non-locally

• Simple model: Data-local scheduling opportunities arrive following Poisson process

• Expected response time gain from delay scheduling:

E(gain) = (1 – e-w/t)(d – t)

Optimal wait time is infinity

Delay from running non-locally

Expected time to get localscheduling opportunityWait amount

Page 17: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Analysis Continued

E(gain) = (1 – e-w/t)(d – t)

• Typical values of t and d: t = (avg task length) / (file replication × tasks per node) d ≥ (avg task length)

Delay from running non-locally

Expected time to get localscheduling opportunityWait amount

3 6

Page 18: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

More Analysis

• What if some nodes are occupied by long tasks?

• Suppose we have:– Fraction of active tasks are long– R replicas of each block– L task slots per node

• For a given data block, P(all replicas blocked by long tasks) = RL

• With R = 3 and L = 6, this is less than 2% if < 0.8

Page 19: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Evaluation

• 100-node EC2 cluster, 4 cores/node

• Job submission schedule based on job sizes and inter-arrival times at Facebook– 100 jobs grouped into 9 “bins” of sizes

• Three workloads:– IO-heavy, CPU-heavy, mixed

• Three schedulers:– FIFO– Fair sharing– Fair + delay scheduling (wait time = 5s)

Page 20: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Results for IO-Heavy Workload

Small Jobs(1-10 input blocks)

Medium Jobs(50-800 input blocks)

Large Jobs(4800 input blocks)

Job Response Times

Up to 5x40%

Page 21: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Results for IO-Heavy Workload

1 2 3 4 5 6 7 8 90%

20%40%60%80%

100%

FIFO Fair Fair + Delay Sched.

Bin

Per

cen

t Loc

al M

aps

1 2 3 4 5 6 7 8 90%

10%20%30%40%50%

Bin

Spee

du

p fr

om D

e-la

y Sc

hed

uli

ng

Page 22: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Sticky Slots Microbenchmark

• 5-50 big concurrent jobs on 100-node EC2 cluster

• 5s delay scheduling 5 Jobs 10 Jobs 20 Jobs 50 Jobs0%

20%

40%

60%

80%

100%

Without Delay Scheduling With Delay SchedulingP

erce

nt L

ocal

Map

s

5 Jobs 10 Jobs 20 Jobs 50 Jobs0

10

20

30

40

50

Without Delay Scheduling With Delay Scheduling

Ben

chm

ark

Ru

nn

ing

Tim

e (

min

ute

s) 2x

Page 23: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Summary

• Delay scheduling works well under two conditions:– Sufficient fraction of tasks are short relative to jobs– There are many locations where a task can run efficiently

• Blocks replicated across nodes, multiple tasks/node

• Generalizes beyond fair sharing & MapReduce– Applies to any queuing policy that gives ordered job list

• Used in Hadoop Fair Scheduler to implement hierarchical policy

– Applies to placement needs other than data locality

Page 24: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Outline

• Challenge 1: data locality– Solution: delay scheduling

• Challenge 2: reduce slot hoarding– Solution: copy-compute splitting

• Current work: Mesos

Page 25: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

The Problem

Job 1

Job 2

Time

Map

s

Time

Red

uces

Page 26: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

The Problem

Job 2 maps doneTime

Map

s

Time

Red

uces

Job 1

Job 2

Page 27: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

The Problem

Job 2 reduces done

Time

Map

s

Time

Red

uces

Job 1

Job 2

Page 28: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

The Problem

Job 1

Job 2

Job 3

Job 3 submittedTime

Map

s

Time

Red

uces

Page 29: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

The Problem

Job 3 maps doneTime

Map

s

Time

Red

uces

Job 1

Job 2

Job 3

Page 30: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

The Problem

Time

Map

s

Time

Red

uces

Job 1

Job 2

Job 3

Page 31: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

The Problem

Problem: Job 3 can’t launch reduces until Job 1 finishesTime

Map

s

Time

Red

uces

Job 1

Job 2

Job 3

Page 32: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Solutions to Reduce Slot Hoarding

• Task killing (preemption):– If a job’s fair share is not met after a timeout, kill

tasks from jobs over their share– Pros: simple, predictable– Cons: wasted work– Available in Hadoop 0.21 and CDH 3

Page 33: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Solutions to Reduce Slot Hoarding

• Copy-compute splitting:– Divide reduce tasks into two types of tasks:

“copy tasks” and “compute tasks”– Separate admission control for these two

• Larger # of copy slots / node, with per-job cap• Smaller # of compute slots that tasks “graduate” into

– Pros: no wasted work, overlaps IO & CPU more– Cons: extra memory, may cause IO contention

Page 34: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Copy-Comp. Splitting in Single Job

Time

Map

s

Maps done

Time

Red

uces

Reduce

Page 35: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Copy-Comp. Splitting in Single Job

Time

Red

uces

Copy

Compute

Time

Map

s

Maps doneOnly IOOnly Comp.

Page 36: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Copy-Comp. Splitting in Single Job

Time

Red

uces

Copy

Compute

Time

Map

s

Maps done

Page 37: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Copy-Comp. Splitting Evaluation

• Initial experiments showed 10% gain in single job and 1.5-2x response time gains over naïve fair sharing in multi-job scenario

• Copy-compute splitting not yet implemented anywhere (killing was good enough for now)

Page 38: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Outline

• Challenge 1: data locality– Solution: delay scheduling

• Challenge 2: reduce slot hoarding– Solution: copy-compute splitting

• Current work: Mesos

Page 39: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Challenges in Hadoop Cluster Management

• Isolation (both fault and resource isolation)– E.g. if co-locating production and ad-hoc jobs

• Testing and deploying upgrades

• JobTracker scalability (in larger clusters)

• Running non-Hadoop jobs– MPI, streaming MapReduce, etc

Page 40: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Mesos

Mesos is a cluster resource manager over which multiple instances of Hadoop and other parallel applications can run

Hadoop

(production)

Hadoop

(ad-hoc)

MPI

Mesos

Node Node Node Node

Page 41: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

What’s Different About It?

Other resource managers exist today, including– Batch schedulers (e.g. Torque, Sun Grid Engine)– VM schedulers (e.g. Eucalyptus)

However, have 2 problems with Hadoop-like workloads:– Data locality hurt due to static partitioning– Utilization compromised because jobs hold nodes for their

full duration

Mesos addresses these through two features:– Fine-grained sharing at the level of tasks– Two-level scheduling where jobs control placement

Page 42: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Mesos Architecture

Mesos slave

Mesos master

Hadoop 0.20

scheduler

Mesos slave

Hadoop job

Hadoop 20 executor

task

Mesos slaveHadoop

19 executor

task

MPIscheduler

MPI job

MPIexecutor

task

Hadoop 0.19

scheduler

Hadoop job

Hadoop 19

executortask

MPIexecutor

task

Page 43: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Mesos Status

• Implemented in 10,000 lines of C++– Efficient event-driven messaging lets master

scale to 10,000 scheduling decisions / second– Fault tolerance using ZooKeeper– Supports Hadoop, MPI & Torque

• Test deployments at Twitter and Facebook

• Open source, at www.github.com/mesos

Page 44: Matei Zaharia ,  Dhruba Borthakur * ,  Joydeep Sen Sarma * ,

Thanks!

• Any questions?

[email protected]