matei zaharia , dhruba borthakur * , joydeep sen sarma * ,
DESCRIPTION
Fair Scheduling for MapReduce Clusters. Matei Zaharia , Dhruba Borthakur * , Joydeep Sen Sarma * , Khaled Elmeleegy + , Scott Shenker , Ion Stoica. UC Berkeley, * Facebook Inc, + Yahoo! Research. UC Berkeley. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
Matei Zaharia, Dhruba Borthakur*, Joydeep Sen Sarma*,Khaled Elmeleegy+, Scott Shenker, Ion Stoica
UC Berkeley, *Facebook Inc, +Yahoo! Research
Fair Scheduling for MapReduce Clusters
UC Berkeley
Motivation• MapReduce / Hadoop originally designed for high
throughput batch processing
• Today’s workload is far more diverse:– Many users want to share a cluster
• Engineering, marketing, business intelligence, etc
– Vast majority of jobs are short• Ad-hoc queries, sampling, periodic reports
– Response time is critical• Interactive queries, deadline-driven reports
How can we efficiently share MapReduce clusters between users?
1 10 100 1000 10000 1000000%
20%
40%
60%
80%
100%
Job Running Time (s)
CDF
Example: Hadoop at Facebook• 2000-node, 20 PB data warehouse, growing at 40 TB/day• Applications: data mining, spam detection, ads• 200 users (half non-engineers)• 7500 MapReduce jobs / day
Median: 84s
Approaches to Sharing
• Hadoop default scheduler (FIFO)– Problem: short jobs get stuck behind long ones
• Separate clusters– Problem 1: poor utilization– Problem 2: costly data replication
• Full replication nearly infeasible at FB / Yahoo! scale• Partial replication prevents cross-dataset queries
Our Work: Hadoop Fair Scheduler
• Fine-grained sharing at level of map & reduce tasks• Predictable response times and user isolation• Example with 3 job pools:
What This Talk is About
• Two challenges for fair sharing in MapReduce– Data locality (conflicts with fairness)– Task interdependence (reduces can’t finish until
maps are done, leading to slot hoarding)
• Some solutions
• Current work: Mesos project
Outline
• Challenge 1: data locality– Solution: delay scheduling
• Challenge 2: reduce slot hoarding– Solution: copy-compute splitting
• Current work: Mesos
The Problem
Job 2Master Job 1
Scheduling order
Slave Slave Slave SlaveSlave Slave
42
1 1 22 33
95 33 6756 9 48 782 1 1
Task 2 Task 5 Task 3 Task 1 Task 7 Task 4
File 1:
File 2:
The Problem
Job 1Master Job 2
Scheduling order
Slave Slave Slave SlaveSlave Slave
42
1 22 3
95 33 6756 9 48 782 1 1
Task 2 Task 5 Task 3 Task 1
File 1:
File 2:
Task 1 Task 7Task 2 Task 4Task 3
Problem: Fair decision hurts locality
Especially bad for jobs with small input files
1 3
10 100 1000 10000 1000000%
20%
40%
60%
80%
100%
Data Locality in Production at Facebook
Node Locality Rack Locality
Job Size (Number of Input Blocks)
Per
cent
Loc
al M
ap T
ask
sLocality vs. Job Size at Facebook
58% of jobs
Special Instance: Sticky Slots
• Under fair sharing, locality can be poor even when all jobs have large input files
• Problem: jobs get “stuck” in the same set of task slots– When one a task in job j finishes, the slot it was running in is
given back to j, because j is below its share– Bad because data files are spread out across all nodes
JobFair
ShareRunning
Tasks
Job 1 2 2
Job 2 2 2
JobFair
ShareRunning
Tasks
Job 1 2 1Job 2 2 2
Slave Slave Slave Slave
Master
Task Task Task Task
Special Instance: Sticky Slots
5 Jobs 10 Jobs 20 Jobs 50 Jobs0%
20%40%60%80%
100%
Data Locality vs. Number of Concurrent Jobs
% L
ocal
Map
Tas
ks• Under fair sharing, locality can be poor even when all
jobs have large input files• Problem: jobs get “stuck” in the same set of task slots
– When one a task in job j finishes, the slot it was running in is given back to j, because j is below its share
– Bad because data files are spread out across all nodes
Solution: Delay Scheduling
• Relax queuing policy to make jobs wait for a limited time if they cannot launch local tasks
• Result: Very short wait time (1-5s) is enough to get nearly 100% locality
Delay Scheduling Example
Job 1Master Job 2
Scheduling order
Slave Slave Slave SlaveSlave Slave
42
1 1 22 33
95 33 6756 9 48 782 1 1
Task 2 Task 3
File 1:
File 2:
Task 8 Task 7Task 2 Task 4Task 6
Idea: Wait a short time to get data-local scheduling opportunities
Task 5Task 1 Task 1Task 3
Wait!
Delay Scheduling Details
• Scan jobs in order given by queuing policy, picking first that is permitted to launch a task
• Jobs must wait before being permitted to launch non-local tasks– If wait < T1, only allow node-local tasks
– If T1 < wait < T2, also allow rack-local
– If wait > T2, also allow off-rack
• Increase a job’s time waited when it is skipped
Analysis
• When is it worth it to wait, and for how long?
Waiting worth it if E(wait time) < cost to run non-locally
• Simple model: Data-local scheduling opportunities arrive following Poisson process
• Expected response time gain from delay scheduling:
E(gain) = (1 – e-w/t)(d – t)
Optimal wait time is infinity
Delay from running non-locally
Expected time to get localscheduling opportunityWait amount
Analysis Continued
E(gain) = (1 – e-w/t)(d – t)
• Typical values of t and d: t = (avg task length) / (file replication × tasks per node) d ≥ (avg task length)
Delay from running non-locally
Expected time to get localscheduling opportunityWait amount
3 6
More Analysis
• What if some nodes are occupied by long tasks?
• Suppose we have:– Fraction of active tasks are long– R replicas of each block– L task slots per node
• For a given data block, P(all replicas blocked by long tasks) = RL
• With R = 3 and L = 6, this is less than 2% if < 0.8
Evaluation
• 100-node EC2 cluster, 4 cores/node
• Job submission schedule based on job sizes and inter-arrival times at Facebook– 100 jobs grouped into 9 “bins” of sizes
• Three workloads:– IO-heavy, CPU-heavy, mixed
• Three schedulers:– FIFO– Fair sharing– Fair + delay scheduling (wait time = 5s)
Results for IO-Heavy Workload
Small Jobs(1-10 input blocks)
Medium Jobs(50-800 input blocks)
Large Jobs(4800 input blocks)
Job Response Times
Up to 5x40%
Results for IO-Heavy Workload
1 2 3 4 5 6 7 8 90%
20%40%60%80%
100%
FIFO Fair Fair + Delay Sched.
Bin
Per
cen
t Loc
al M
aps
1 2 3 4 5 6 7 8 90%
10%20%30%40%50%
Bin
Spee
du
p fr
om D
e-la
y Sc
hed
uli
ng
Sticky Slots Microbenchmark
• 5-50 big concurrent jobs on 100-node EC2 cluster
• 5s delay scheduling 5 Jobs 10 Jobs 20 Jobs 50 Jobs0%
20%
40%
60%
80%
100%
Without Delay Scheduling With Delay SchedulingP
erce
nt L
ocal
Map
s
5 Jobs 10 Jobs 20 Jobs 50 Jobs0
10
20
30
40
50
Without Delay Scheduling With Delay Scheduling
Ben
chm
ark
Ru
nn
ing
Tim
e (
min
ute
s) 2x
Summary
• Delay scheduling works well under two conditions:– Sufficient fraction of tasks are short relative to jobs– There are many locations where a task can run efficiently
• Blocks replicated across nodes, multiple tasks/node
• Generalizes beyond fair sharing & MapReduce– Applies to any queuing policy that gives ordered job list
• Used in Hadoop Fair Scheduler to implement hierarchical policy
– Applies to placement needs other than data locality
Outline
• Challenge 1: data locality– Solution: delay scheduling
• Challenge 2: reduce slot hoarding– Solution: copy-compute splitting
• Current work: Mesos
The Problem
Job 1
Job 2
Time
Map
s
Time
Red
uces
The Problem
Job 2 maps doneTime
Map
s
Time
Red
uces
Job 1
Job 2
The Problem
Job 2 reduces done
Time
Map
s
Time
Red
uces
Job 1
Job 2
The Problem
Job 1
Job 2
Job 3
Job 3 submittedTime
Map
s
Time
Red
uces
The Problem
Job 3 maps doneTime
Map
s
Time
Red
uces
Job 1
Job 2
Job 3
The Problem
Time
Map
s
Time
Red
uces
Job 1
Job 2
Job 3
The Problem
Problem: Job 3 can’t launch reduces until Job 1 finishesTime
Map
s
Time
Red
uces
Job 1
Job 2
Job 3
Solutions to Reduce Slot Hoarding
• Task killing (preemption):– If a job’s fair share is not met after a timeout, kill
tasks from jobs over their share– Pros: simple, predictable– Cons: wasted work– Available in Hadoop 0.21 and CDH 3
Solutions to Reduce Slot Hoarding
• Copy-compute splitting:– Divide reduce tasks into two types of tasks:
“copy tasks” and “compute tasks”– Separate admission control for these two
• Larger # of copy slots / node, with per-job cap• Smaller # of compute slots that tasks “graduate” into
– Pros: no wasted work, overlaps IO & CPU more– Cons: extra memory, may cause IO contention
Copy-Comp. Splitting in Single Job
Time
Map
s
Maps done
Time
Red
uces
Reduce
Copy-Comp. Splitting in Single Job
Time
Red
uces
Copy
Compute
Time
Map
s
Maps doneOnly IOOnly Comp.
Copy-Comp. Splitting in Single Job
Time
Red
uces
Copy
Compute
Time
Map
s
Maps done
Copy-Comp. Splitting Evaluation
• Initial experiments showed 10% gain in single job and 1.5-2x response time gains over naïve fair sharing in multi-job scenario
• Copy-compute splitting not yet implemented anywhere (killing was good enough for now)
Outline
• Challenge 1: data locality– Solution: delay scheduling
• Challenge 2: reduce slot hoarding– Solution: copy-compute splitting
• Current work: Mesos
Challenges in Hadoop Cluster Management
• Isolation (both fault and resource isolation)– E.g. if co-locating production and ad-hoc jobs
• Testing and deploying upgrades
• JobTracker scalability (in larger clusters)
• Running non-Hadoop jobs– MPI, streaming MapReduce, etc
Mesos
Mesos is a cluster resource manager over which multiple instances of Hadoop and other parallel applications can run
Hadoop
(production)
Hadoop
(ad-hoc)
MPI
Mesos
Node Node Node Node
…
What’s Different About It?
Other resource managers exist today, including– Batch schedulers (e.g. Torque, Sun Grid Engine)– VM schedulers (e.g. Eucalyptus)
However, have 2 problems with Hadoop-like workloads:– Data locality hurt due to static partitioning– Utilization compromised because jobs hold nodes for their
full duration
Mesos addresses these through two features:– Fine-grained sharing at the level of tasks– Two-level scheduling where jobs control placement
Mesos Architecture
Mesos slave
Mesos master
Hadoop 0.20
scheduler
Mesos slave
Hadoop job
Hadoop 20 executor
task
Mesos slaveHadoop
19 executor
task
MPIscheduler
MPI job
MPIexecutor
task
Hadoop 0.19
scheduler
Hadoop job
Hadoop 19
executortask
MPIexecutor
task
Mesos Status
• Implemented in 10,000 lines of C++– Efficient event-driven messaging lets master
scale to 10,000 scheduling decisions / second– Fault tolerance using ZooKeeper– Supports Hadoop, MPI & Torque
• Test deployments at Twitter and Facebook
• Open source, at www.github.com/mesos