understanding the effects and implications of compute node failures in
DESCRIPTION
Understanding the Effects and Implications of Compute Node Failures in . Florin Dinu T. S. Eugene Ng. Computing in the Big Data Era. Big Data – Challenging for previous systems Big Data Frameworks MapReduce @ Google Dryad @Microsoft Hadoop @ Yahoo & Facebook. 100PB. 20PB. 15PB. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/1.jpg)
Understanding the Effects and Implications of Compute Node Failures in
Florin Dinu T. S. Eugene Ng
![Page 2: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/2.jpg)
2
Computing in the Big Data Era
15PB20PB100PB120PB
• Big Data – Challenging for previous systems
• Big Data Frameworks – MapReduce @ Google– Dryad @Microsoft– Hadoop @ Yahoo & Facebook
![Page 3: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/3.jpg)
3
Image Processing
Protein Sequencing
Web Indexing
Machine Learning
Advertising Analytics
Log Storage and Analysis
Is Widely Used
and many more …..
![Page 4: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/4.jpg)
4
SIGMOD 2010
Building Around
![Page 5: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/5.jpg)
5
Building On Top Of
Building on core Hadoop functionality
![Page 6: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/6.jpg)
6
The Danger of Compute-Node Failures
“ In each cluster’s first year,it’s typical that 1,000 individual machine failures will occur; thousands of hard drive failures will occur”
Jeff Dean – Google I/O 2008
Causes: • large scale• use of commodity components
“ Average worker deaths per job: 5.0 ” Jeff Dean – Keynote I – PACT 2006
![Page 7: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/7.jpg)
7
The Danger of Compute-Node Failures
In the cloud compute node failuresare the norm NOT the exception
Amazon, SOSP 2009
![Page 8: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/8.jpg)
8
Failures From Hadoop’s Point of View
Important to understand effect of compute-node failures on Hadoop
Situations indistinguishable from compute node failures:
• Switch failures
• Longer-term dis-connectivity• Unplanned reboots• Maintenance work (upgrades)• Quota limits
• Challenging environments• Spot markets (price driven availability)• Volunteering systems• Virtualized environments
![Page 9: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/9.jpg)
9
• Hadoop is widely used• Compute node failures are common
Hadoop needs to be failure resilient
The Problem
Hadoop needs to be failure resilient in an efficient way
• Minimize impact on job running times• Minimize resources needed
![Page 10: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/10.jpg)
10
Contribution
• First in-depth analysis of the impact of failures on Hadoop
– Uncover several inefficiencies• Potential for future work
– Immediate practical relevance
– Basis for realistic modeling of Hadoop
![Page 11: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/11.jpg)
11
Quick Hadoop Background
![Page 12: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/12.jpg)
12
Background – the Tasks
RM
MGR
Master
DataNode
TaskTracker
Reducer taskMap task
Give me work !
RM
M RMore work ?
JobTrackerNameNode
RM
2 waves of R 2 waves of M
![Page 13: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/13.jpg)
13
Background – Data Flow
M M M
R R R
HDFS
Map Tasks
Shuffle
Reducer Tasks
HDFS
![Page 14: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/14.jpg)
14
Background – Speculative Execution
M M M
0 <= Progress Score <= 1 Progress Rate = (Progress Score/time) Ex: 0.05/sec
Ideal case:Similar progress rates
![Page 15: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/15.jpg)
15
Background – Speculative Execution (SE)
M M M
Reality:Varying progress rates
!Goal of SE:
Detect underperforming nodesDuplicate the computation
Reasons for underperforming tasksNode overload, network congestion, etc.
Underperforming tasks (outliers) in Hadoop:> 1 STD slower than mean progress rate
M
![Page 16: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/16.jpg)
16
How does Hadoop detect failures?
![Page 17: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/17.jpg)
17
MGR
Master
DataNode
TaskTracker
M R
Failures of the Distributed Processes
Timeouts, Heartbeats & Periodic Checks
Heartbeats
![Page 18: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/18.jpg)
18
Timeouts, Heartbeats & Periodic Checks
Conservative approach – last line of defense
Time
Failure interrupts heartbeat stream
Periodically check for changes
Declare failure after a number of checks
AHA !It failed
![Page 19: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/19.jpg)
19
Failures of the Individual Tasks (Maps)M M
R R
Infer map failures from notificationsConservative – not react to temporary failures
MGR
R
Give me data!
M does not answer !!
1 2 3
RR M does not answer !!
M
Δt Δt
![Page 20: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/20.jpg)
20
• R complains too much? (failed/ succ. attempts)
• R stalled for too long? (no new succ. attempts)
M M
R R
Notifications also help infer reducer failures
Give me data!
M M
Give me data!
X
MGR M does not answer !! R
Failures of the Individual Tasks (Reducers)
![Page 21: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/21.jpg)
21
Do these mechanisms work well?
![Page 22: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/22.jpg)
22
Methodology• Focus on failures of distributed components (TaskTracker and DataNode)
• Inject these failures separately
• Single failures– Enough to catch many shortcomings– Identified mechanisms responsible– Relevant to multiple failures too
DataNode
TaskTracker
M R
![Page 23: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/23.jpg)
23
Mechanisms Under Task Tracker Failure?
LARGE, VARIABLE, UNPREDICTABLE job running timesPoor performance under failure
• OpenCirrus
• Sort 10GB
• 15 nodes
• 14 reducers
• Inject fail at random time
220s running time without failuresFindings also relevant to larger jobs
![Page 24: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/24.jpg)
24
Few reducers impacted. Notification mechanismineffective
Timeouts fire.
70% cases – notification mechanism ineffective
Clustering Results Based on Cause
Failure has no impactNot due to notifications
![Page 25: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/25.jpg)
25
More reducers impacted
Notification mechanismdetects failure
Timeouts do not fire.
Notification mechanism detects failure in:• Few cases• Specific moment in the job
Clustering Results Based on Cause
![Page 26: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/26.jpg)
26
• R complains too much? (failed/ total attempts)
e.g. 3 out of 3 failedGive me data!
Side Effects: Induced Reducer DeathFailures propagate to healthy tasks
Negative Effects: • Time and resource waste for re-execution• Job failure - a small number of runs fail completely
XMGR M does not answer !! R
Unlucky reducers die early
M
![Page 27: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/27.jpg)
27
• R stalled for too long? (no new succ. attempts)
Side Effects: Induced Reducer Death
MGR M does not answer !! R
Give me data!
X
All reducers may eventually dieFundamental problem:• Inferring task failures from connection failures• Connection failures have many possible causes• Hadoop has no way to distinguish the cause (src?
dst?)
M
![Page 28: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/28.jpg)
28
CDF
More Reducers: 4/Node = 56 Total
Job running time spread out even moreMore reducers = more chances for explained effects
![Page 29: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/29.jpg)
29
Effect of DataNode Failures
TaskTracker
M R
DataNode
![Page 30: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/30.jpg)
30
Timeouts When Writing Data
RMX
Write Timeout (WTO)
![Page 31: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/31.jpg)
31
Timeouts When Writing Data
RMX
Connect Timeout (CTO)
![Page 32: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/32.jpg)
32
Effect on Speculative ExecutionOutliers in Hadoop: >1 STD slower than mean progress rate
LowPR
HighPR
AVG
AVG – 1*STD
Outliers
Very high PR
AVG
AVG – 1*STD
![Page 33: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/33.jpg)
33
Delayed Speculative Execution
M
9 11
M
50sWaiting for
mappers
M
9 11
M
100sMap outputs
read
Avg(PR)- STD(PR)
9 11
150sReducer
write output
![Page 34: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/34.jpg)
34
Delayed Speculative Execution
9 11
200sFailure occurs
Reducers timeoutR9 speculatively exec
9 11!
9
> 200sNew R9 skews stats
M
M
Very low
400sR11 finally speculatively
exec.
11 11!
Finally low enough
WTOWTO WTO WTO WTO
![Page 35: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/35.jpg)
35
Delayed Speculative Execution
• Hadoop’s assumptions about progress rates invalidated
• Stats skewed by very fast speculated task
• Significant impact on job running time
9
Very low
![Page 36: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/36.jpg)
36
52 reducers – 1 Wave
Reducers stuck in WTODelayed speculative execution
CTO after WTOReconnect to failed DataNode
![Page 37: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/37.jpg)
37
Delayed SE – A General Problem• Failures and timeouts are not the only cause
• To suffer from delayed SE :• Slow tasks that benefit from SE• I showed the ones stuck in a WTO• Other: slow or heterogeneous nodes,
slow transfers (heterogeneous networks)
• Fast advancing tasks • I showed varying data input availability• Other: varying task input size
varying network speed
Statistical SE algorithms need to be carefully used
![Page 38: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/38.jpg)
38
Conclusion - Inefficiencies Under Failures
• Task Tracker failures– Large, variable and unpredictable job running times– Variable efficiency depending on reducer number– Failures propagate to healthy tasks– Success of TCP connections not enough
• Data Node failures– Delayed speculative execution– No sharing of potential failure information (details in paper)
![Page 39: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/39.jpg)
39
Ways Forward• Provide dynamic info about infrastructure to
applications (at least in the private DCs)
• Make speculative execution cause aware– Why is a task slow at runtime?– Move beyond statistical SE algorithms– Estimate PR of tasks (use envir, data characteristics)
• Share some information between tasks– In Hadoop tasks rediscover failures individually– Lots of work on SE decisions (when, where to SE)– This decisions can be invalidate by such runtime
inefficiencies
![Page 40: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/40.jpg)
40
Thank you
![Page 41: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/41.jpg)
41
Backup slides
![Page 42: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/42.jpg)
42Large variability in job running times
Experiment: ResultsGroup G2
Group G6 Group G7
Group G3
Group G5
Group G1
Group G4
![Page 43: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/43.jpg)
43
Group G1 – few reducers impacted
Slow recovery when few reducers impacted
M1
R1
M1 copied by all reducers before failure.
R1_1X
JobTracker
After failure R1_1 cannot access M1.R1_1 needs to send 3 notifications ~ 1250sTask Tracker declared dead after 600-800s
M2M3 Notif
(M1)
R2
R3
![Page 44: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/44.jpg)
44
Group G2 – timing of failure
Timing of failure relative to Job Tracker checks impacts job running time
TimeG1
G2
170s
170sTime
Job end
600s
600s
200s
200s difference between G1 and G2.
![Page 45: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/45.jpg)
45
Group G3 – early notifications
Early notifications increase job running time variability
• G1 notifications sent after 416s
• G3 early notifications => map outputs declared lostCauses:
• Code-level race conditions• Timing of a reducer’s shuffle attempts
R2
X
M5R2
X
0 1 2 3 4 5 6
M5-1M6-1
M5-2M6-2
M5-3M6-3
M5-4M6-4
M6-1 M5-1M6-2
M5-2M6-3
M5-3M6-4
0 1 2 3 4 5 6
M5
M6
M6
M5-4M6-5
![Page 46: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/46.jpg)
46
Group G4 & G5 – many reducers impacted
Job running time under failure varies with nr of reducers impacted
R1_1
X
JobTracker
G4 - Many reducers send notifications after 416s - Map output is declared lost before the Task Tracker is declared deadG5 - Same as G4 but early notifications are sent
Notif(M1,M2,M3,M4,M5)
M1
R1M2M3
R2
R3
![Page 47: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/47.jpg)
47
Task Tracker Failures
Gew reducers impacted. Not enough notifications.Timeouts fire.
Many reducers impacted.Enough notifications sentTimeouts do not fire
LARGE, VARIABLE, UNPREDICTABLE job running timesEfficiency varies with number of affected reducers
![Page 48: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/48.jpg)
48
CDF
Node Failures: No RST Packets
No RST -> No Notifications -> Timeouts always fire
![Page 49: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/49.jpg)
49
Not Sharing Failure InformationDifferent SE algorithm(OSDI 08)
Tasks SE even before failure.
Delayed SE not the cause.
Both initial and SE task connect to failed nodeNo sharing of potential failure information
![Page 50: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/50.jpg)
50
t Outlier: avg(PR(all)) – std(PR(all)) > PR(t)
limit
R9 R11
Delayed Speculative Execution
Stats skewed by very fast speculative tasks.Hadoop’s assumptions about prog. rates invalidated
M M
91111
WTO
WTO
911
![Page 51: Understanding the Effects and Implications of Compute Node Failures in](https://reader036.vdocuments.site/reader036/viewer/2022062323/56816769550346895ddc50cb/html5/thumbnails/51.jpg)
51
Timeline:• ~50s reducers wait for map outputs
• ~100s reducers get map outputs
• ~200s failure => reducers timeout
• ~200s R9 speculatively executed huge progress rate
statistics skewed
• ~400s R11 finally speculatively executed
Delayed Speculative Execution
Stats skewed by very fast speculative tasks.Hadoop’s assumptions about prog. rates invalidated
M M
91111
WTOWTO
911