wrangler - stanford universityneerajay/26-yadwadkar-slides.pdfneeraja j. yadwadkar, ! ganesh...

Wrangler Predictable and Faster Jobs using

Fewer Resources!

Neeraja J. Yadwadkar, ��Ganesh Ananthanarayanan and Randy Katz

1

Parallel Data Analytics!Job completed

Job queue

Master

Slaves 2

Stragglers!Job completed

Job queue

Master

Slaves 3

Defining Stragglers!

Task Execution Time

Tasks of a job

4

Impact of Stragglers!Presence of Stragglers in real-world production level traces*:

Workload Stragglers

Facebook 2009 (FB2009)

Facebook 2010 (FB2010)

Cloudera’s Customer b (CC_b)

Cloudera’s Customer e (CC_e)

24 %

23 %

28 %

22 %

When replayed using SWIM+ on a 50 node EC2 cluster….

*captured for over 6 months from about 4000 machines in total

Threshold = 1.3 * median

+Chen Y., et al., The Case for Evaluating MapReduce Performance Using Workload Suites, MASCOTS’11 5

Impact of Stragglers!

Dolly, NSDI’13 6

Speculative Execution!

Job queue

Replicating

TS: In progress

Job completed

TS !

TS !

TS !

7

Existing Approaches

8

Wait

Replicate

Wasted Time in detecting stragglers

Wasted Resources LATE

OSDI’08

Speculative Execution

OSDI’04

Mantri OSDI’10 Dolly

NSDI’13

Wrangler

Design Goals!

1.  Identify stragglers as early as possible

2.  Schedule tasks for improved job finishing times

1.  To avoiding creation of stragglers 2.  To avoid replication

9

Avoid Wasting

Resources

Avoid Wasting Time in detecting stragglers

Defer potential stragglers!

Job Submitted Map finished Map finished Job finished Job finished

Net Gain

Map1

Map2

Map3

Reduce Deferred

Assignments

Time

10

Model Builder

Predictive Scheduler

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

“Input” Features

Scheduling Decisions

11

Selecting “Input” Features!

MapReduce: Dean, et. al, OSDI’04

LATE: Zaharia, et. al, OSDI’08

Mantri: Ananthanarayanan, et. al, OSDI’10

CPU -  cpu_idle -  cpu_system -  cpu_user -  …

Network -  bytes_in -  bytes_out -  …

Memory -  mem_buffers -  mem_cached -  mem_free -  mem_shared -  mem_total -  …

Disk -  swap_free -  swap_total -  disk_free -  disk_total -  …

Other -  Jvm. memHeapUsedM -  Jvm. threadsBlocked -  Jvm. gcTimeMillis -  … -  datanode.blocks_written -  … -  rpc.callQueueLen -  rpc.OpenConnections -  …

12

Selecting “Input” Features!

Key observation using Subset selection methods

Features of importance vary across nodes and across time!

•  Complex task-to-node interactions •  Complex task-to-task interactions •  Heterogeneous clusters •  Heterogeneous task requirements

Why? 13

Model Builder


Worker

Master

Worker 1

Workers


“Input” Features


14

Model Builder


Worker

Master

Worker 1

Workers



Utilization Counters

15

Model Builder


Worker

Master

Worker 1

Workers




16

Unknown and dynamic causes behind stragglers

A Challenge in Model Building !

17

Desired: An ability of predicting stragglers without needing domain-specific knowledge

Approach: Classification! Techniques that learn to adapt to changing

causes automatically

Classification for Predicting Stragglers!

18

{Utilization Counters} Straggler Non-Straggler (~100)

Predict:

Build a model:

{Utilization Counters, Straggler/Non-Straggler} Learning Classifier

Classifier

For every node in a cluster,

Classification Technique: SVM

19

Non-Stragglers

Stragglers

feature1

feat

ure 2

Separating Hyper-plane

Classification Accuracy using SVM!

20

0

20

40

60

80

100

120

FB2009 FB2010 CC_b CC_e

Pre

dict

ion

Acc

urac

y (%

) Stragglers (True Positive) Non-Stragglers (True Negative)

Is this accuracy good enough?

Worker

Master

Worker 1

Workers




Model Builder


21

Predictive Scheduler (Naïve)!

22

Defer scheduling a task on a node that is predicted to create a

straggler

Intuition…. !

Time

Resource Usage

Key: Better load-balancing

1. Improved job completion

2. Reduced resource consumption

23

Does the Naïve approach improve job completion?!

!4.22% !5.53% !5.39%

11.34%

!5.51%

43.59%

58.87% 62.10%

43.13%

22.46%

!10%0%10%20%30%40%50%60%70%80%90%

100%

avg% 95p% 97p% 99p% 99.9p%

Percen

tage%Red

uc;o

n%

Predic;on%without%confidence%measure%Predic;on%with%confidence%measure%(p=0.8)%

Baseline: Speculative Execution

Workload: CC_b

24


25

Model Builder


Worker

Master

Worker 1

Workers



Confident? Yes


26

Model Builder


Worker

Master

Worker 1

Workers



Confident? Yes

Only confident predictions influence scheduling decisions

Confidence Measure

Non-Stragglers

Stragglers

feature1

feat

ure 2

Separating Hyper-plane

Corresponds to P: confidence

threshold 27

Low-confidence zone

p: Confidence Threshold!

•  Value calculated via cross-validation

•  Verified via empirical sensitivity analysis

(refer to paper for details)

28

Evaluation!

•  Aim I: Does Wrangler Improve Job Completion Times?

•  Aim II: Does Wrangler Reduce Resources Consumed?

29

Aim 1: Does Wrangler Improve Job Completion Times?!

!4.22% !5.53% !5.39%

11.34%

!5.51%

43.59%

58.87% 62.10%

43.13%

22.46%

!10%0%10%20%30%40%50%60%70%80%90%

100%

avg% 95p% 97p% 99p% 99.9p%

Percen

tage%Red

uc;o

n%

Predic;on%without%confidence%measure%Predic;on%with%confidence%measure%(p=0.8)%

Workload: CC_b Baseline: Speculative Execution

Confidence measure is the key! 30

Aim II: Does Wrangler Reduce Resources Consumed?!

55.09

24.77

40.15

8.24

0 10 20 30 40 50 60 70 80 90

100

FB2009 FB2010 CC_b CC_e

Percen

tage Red

uc;o

n in

Total Task-‐Second

s

Baseline: Speculative Execution 31

Load-Balancing with Wrangler!

32

Few highly loaded nodes

Workload: FB2010

Sophisticated schedulers exist….why Wrangler?!

•  Difficult to – An9cipate the dynamically changing causes – Deal with over 100 resource u9liza9on counters – Build a generic and unbiased scheduler

Wrangler statistically learns to

predict stragglers without needing domain-specific knowledge

33

Status and Future Directions!

•  Status: –  Improving accuracy using lesser training data –  Improving job completions further

•  Future Directions: – Generalization to other problems in cluster

computing environments •  Fault Detection •  Resource Allocation

34

Conclusions!•  ML techniques enable early prediction of tasks

that are likely to straggle. •  Use of confidence measure makes straggler

predictions useful

•  Wrangler: – Enables faster jobs – Uses fewer resources

Contact: [email protected] 35

wrangler - stanford universityneerajay/26-yadwadkar-slides.pdfneeraja j. yadwadkar, ! ganesh...

Documents