wrangler - stanford universityneerajay/26-yadwadkar-slides.pdfneeraja j. yadwadkar, ! ganesh...

35
Wrangler Predictable and Faster Jobs using Fewer Resources Neeraja J.Yadwadkar, Ganesh Ananthanarayanan and Randy Katz 1

Upload: others

Post on 27-Oct-2020

19 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Wrangler  Predictable and Faster Jobs using

Fewer Resources!

Neeraja J. Yadwadkar, ���Ganesh Ananthanarayanan and Randy Katz

1

Page 2: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Parallel Data Analytics!Job completed

Job queue

Master

Slaves 2

Page 3: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Stragglers!Job completed

Job queue

Master

Slaves 3

Page 4: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Defining Stragglers!

Task Execution Time

Tasks of a job

4

Page 5: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Impact of Stragglers!Presence of Stragglers in real-world production level traces*:

Workload Stragglers

Facebook 2009 (FB2009)

Facebook 2010 (FB2010)

Cloudera’s Customer b (CC_b)

Cloudera’s Customer e (CC_e)

24 %

23 %

28 %

22 %

When replayed using SWIM+ on a 50 node EC2 cluster….

*captured for over 6 months from about 4000 machines in total

Threshold = 1.3 * median

+Chen Y., et al., The Case for Evaluating MapReduce Performance Using Workload Suites, MASCOTS’11 5

Page 6: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Impact of Stragglers!

Dolly, NSDI’13 6

Page 7: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Speculative Execution!

Job queue

Replicating

TS:  In  progress  

Job completed

TS !

TS !

TS !

7

Page 8: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Existing Approaches

8

Wait

Replicate

Wasted Time in detecting stragglers

Wasted Resources LATE

OSDI’08

Speculative Execution

OSDI’04

Mantri OSDI’10 Dolly

NSDI’13

Wrangler

Page 9: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Design Goals!

1.  Identify stragglers as early as possible

2.  Schedule tasks for improved job finishing times

1.  To avoiding creation of stragglers 2.  To avoid replication

9

Avoid Wasting

Resources

Avoid Wasting Time in detecting stragglers

Page 10: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Defer potential stragglers!

Job Submitted Map finished Map finished Job finished Job finished

Net Gain

Map1

Map2

Map3

Reduce Deferred

Assignments

Time

10

Page 11: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Model Builder

Predictive Scheduler

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

“Input” Features

Scheduling Decisions

11

Page 12: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Selecting “Input” Features!

MapReduce: Dean, et. al, OSDI’04

LATE: Zaharia, et. al, OSDI’08

Mantri: Ananthanarayanan, et. al, OSDI’10

CPU -  cpu_idle -  cpu_system -  cpu_user -  …

Network -  bytes_in -  bytes_out -  …

Memory -  mem_buffers -  mem_cached -  mem_free -  mem_shared -  mem_total -  …

Disk -  swap_free -  swap_total -  disk_free -  disk_total -  …

Other -  Jvm. memHeapUsedM -  Jvm. threadsBlocked -  Jvm. gcTimeMillis -  … -  datanode.blocks_written -  … -  rpc.callQueueLen -  rpc.OpenConnections -  …

12

Page 13: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Selecting “Input” Features!

Key observation using Subset selection methods

Features of importance vary across nodes and across time!

•  Complex task-to-node interactions •  Complex task-to-task interactions •  Heterogeneous clusters •  Heterogeneous task requirements

Why?  13

Page 14: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Model Builder

Predictive Scheduler

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

“Input” Features

Scheduling Decisions

14

Page 15: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Model Builder

Predictive Scheduler

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

Scheduling Decisions

Utilization Counters

15

Page 16: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Model Builder

Predictive Scheduler

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

Scheduling Decisions

Utilization Counters

16

Page 17: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Unknown and dynamic causes behind stragglers

A Challenge in Model Building !

17

Desired: An ability of predicting stragglers without needing domain-specific knowledge

Approach: Classification! Techniques that learn to adapt to changing

causes automatically

Page 18: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Classification for Predicting Stragglers!

18

{Utilization Counters} Straggler Non-Straggler (~100)

Predict:

Build a model:

{Utilization Counters, Straggler/Non-Straggler} Learning Classifier

Classifier

For every node in a cluster,

Page 19: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Classification Technique: SVM

19

Non-Stragglers

Stragglers

feature1

feat

ure 2

Separating Hyper-plane

Page 20: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Classification Accuracy using SVM!

20

0  

20  

40  

60  

80  

100  

120  

FB2009   FB2010   CC_b   CC_e  

Pre

dict

ion

Acc

urac

y (%

) Stragglers (True Positive) Non-Stragglers (True Negative)

Is this accuracy good enough?

Page 21: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Worker

Master

Worker 1

Workers

Our proposal: Wrangler

Utilization Counters

Scheduling Decisions

Model Builder

Predictive Scheduler

21

Page 22: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Predictive Scheduler (Naïve)!

22

Defer scheduling a task on a node that is predicted to create a

straggler

Page 23: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Intuition…. !

Time

Resource Usage

Key: Better load-balancing

1. Improved job completion

2. Reduced resource consumption

23

Page 24: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Does the Naïve approach improve job completion?!

!4.22% !5.53% !5.39%

11.34%

!5.51%

43.59%

58.87% 62.10%

43.13%

22.46%

!10%0%10%20%30%40%50%60%70%80%90%

100%

avg% 95p% 97p% 99p% 99.9p%

Percen

tage%Red

uc;o

n%

Predic;on%without%confidence%measure%Predic;on%with%confidence%measure%(p=0.8)%

Baseline: Speculative Execution

Workload: CC_b

24

Page 25: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Utilization Counters

25

Model Builder

Predictive Scheduler

Worker  

Master

Worker 1

Workers

Our proposal: Wrangler

Scheduling Decisions

Confident? Yes  

Page 26: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Utilization Counters

26

Model Builder

Predictive Scheduler

Worker  

Master

Worker 1

Workers

Our proposal: Wrangler

Scheduling Decisions

Confident? Yes  

Only confident predictions influence scheduling decisions

Page 27: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Confidence Measure

Non-Stragglers

Stragglers

feature1

feat

ure 2

Separating Hyper-plane

Corresponds to P: confidence

threshold 27

Low-confidence zone

Page 28: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

p: Confidence Threshold!

•  Value calculated via cross-validation

•  Verified via empirical sensitivity analysis

(refer to paper for details)

28

Page 29: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Evaluation!

•  Aim I: Does Wrangler Improve Job Completion Times?

•  Aim II: Does Wrangler Reduce Resources Consumed?

29

Page 30: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Aim 1: Does Wrangler Improve Job Completion Times?!

!4.22% !5.53% !5.39%

11.34%

!5.51%

43.59%

58.87% 62.10%

43.13%

22.46%

!10%0%10%20%30%40%50%60%70%80%90%

100%

avg% 95p% 97p% 99p% 99.9p%

Percen

tage%Red

uc;o

n%

Predic;on%without%confidence%measure%Predic;on%with%confidence%measure%(p=0.8)%

Workload: CC_b Baseline: Speculative Execution

Confidence measure is the key! 30

Page 31: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Aim II: Does Wrangler Reduce Resources Consumed?!

55.09

24.77

40.15

8.24

0 10 20 30 40 50 60 70 80 90

100

FB2009 FB2010 CC_b CC_e

Percen

tage  Red

uc;o

n  in  

Total  Task-­‐Second

s  

Baseline: Speculative Execution 31

Page 32: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Load-Balancing with Wrangler!

32

Few highly loaded nodes

Workload: FB2010

Page 33: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Sophisticated schedulers exist….why Wrangler?!

•  Difficult  to    – An9cipate  the  dynamically  changing  causes  – Deal  with  over  100  resource  u9liza9on  counters  – Build  a  generic  and  unbiased  scheduler  

 Wrangler statistically learns to

predict stragglers without needing domain-specific knowledge

33

Page 34: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Status and Future Directions!

•  Status: –  Improving accuracy using lesser training data –  Improving job completions further

•  Future Directions: – Generalization to other problems in cluster

computing environments •  Fault Detection •  Resource Allocation

34

Page 35: Wrangler - Stanford Universityneerajay/26-yadwadkar-slides.pdfNeeraja J. Yadwadkar, ! Ganesh Ananthanarayanan and Randy Katz" 1" Parallel Data Analytics! Job completed" Job queue"

Conclusions!•  ML techniques enable early prediction of tasks

that are likely to straggle. •  Use of confidence measure makes straggler

predictions useful

•  Wrangler: – Enables faster jobs – Uses fewer resources

Contact: [email protected] 35