thanh do*, mingzhe hao, tanakorn leesatapornwongsa...

28
1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi

Upload: others

Post on 28-Sep-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

1

Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi

Page 2: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

2

q Growing complexity of … §  Technology scaling §  Manufacturing §  Design logic §  Usage §  Operating environment

q … makes HW fail differently §  Complete fail-stop §  Fail partial §  Corruption §  Performance degradation?

Rich literature

Page 3: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

3

“… 1Gb NIC card on a machine that suddenly starts transmitting at 1 kbps,

this slow machine caused a chain reaction upstream in such a way that the performance of entire workload for a 100 node cluster was crawling at a snail's pace, effectively making the system unavailable for all practical purposes.” – Borthakur of Facebook

Degraded NIC! (1000000x)

Cascading impact!

More stories in the paper

Page 4: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

4

q Does HW degrade? Yes §  Limpware: Hardware whose performance degrades

significantly compared to its specification

q Is this a destructive failure mode? Yes §  Cascading failures, no “fail in place”

q No systematic analysis on its impact

Page 5: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

5

q 56 experiments that benchmark 5 systems §  Hadoop, HDFS, Zookeeper, Cassandra, HBase §  22 protocols §  8 hours under normal scenarios §  207 hours under limpware scenarios

q Unearth many limpware-intolerant designs

Our findings: A single piece of limpware (e.g. NIC) causes severe impact on a whole cluster

Page 6: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

6

q Introduction

q System analysis

q Limplock

q Limpware-Tolerant Systems

q Conclusion

Page 7: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

7

q “The performance of a 100 node cluster was crawling at a snail's pace” – Facebook

q But, … why?

Page 8: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

8

q Goals §  Measure system-level impacts §  Find design flaws

q Methodology §  Target cloud systems (e.g., HDFS, Hadoop, ZooKeeper) §  Inject load + limpware

-  E.g. slow a NIC to 1 Mbps, 0.1 Mbps, etc.

§  White-box analysis (internal probes) -  Find design flaws

Page 9: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

9

q Run a distributed protocol §  E.g., 3-node write in HDFS

q Measure slowdowns under: §  No failure, crash, a degraded NIC

workload 10 Mbps

NIC

1Mbps NIC

0.1 Mbps NIC

1

10x slower

100x slower

1000x slower

Execution slowdown

Page 10: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

10

q Introduction

q System analysis §  Hadoop case study

q Limplock

q Limpware Tolerant Cloud Systems

q Conclusion

Page 11: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

11

q Hadoop tail-tolerant? §  Why speculative exec is not triggered?

q Consider degraded NIC on a map node §  Task M2’s speed = M1 and M3 §  Input data is local!

q But all reducers are slow §  Straggler: slow vs. others of same job §  No straggler detected!

q Flaws §  Task-level straggler detection §  Single point of failure!

Wordcount on Hadoop

Mappers Reducers

M1

M2

M3

1

10

Page 12: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

12

q A degraded NIC à degraded tasks §  (Degraded tasks are slower by orders of magnitude)

q Slow tasks use up slots à degraded node §  Default: 2 mappers and 2 reducers per node §  If all slots are used à node is “unavailable”

q All nodes in limp mode à degraded cluster

M M

R R

Healthy node in limp mode

Node with slow NIC

Page 13: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

110

1001000

Exec

utio

nSl

owdo

wn

No failureDisk crash

8 MB/s0.8 MB/s

0.08 MB/sF1. Logging

(Master)F2. Write

(Data)F3. Read

(Data)F4. Read/Logging

(Master)F5. Checkpoint

(Secondary)

110

1001000

Exec

utio

nSl

owdo

wn

No failureNode crash

10 Mbps1 Mbps

0.1 MbpsF6. Write

(Data)F7. Read

(Data)F8. Regeneration

(Data)F9. Regeneration(Data-S/Data-D)

F10. Balancing(Data-O/Data-U)

110

1001000

Exec

utio

nSl

owdo

wn

F11. Decommission(Data-L/Data-R)

H1. Spec. Exec.(Mapper)

H2. Spec. Exec.(Reducer)

H3. Spec. Exec.(Job Tracker)

H4. Spec. Exec.(Task Node)

Z1. Get(Leader)

110

1001000

Exec

utio

nSl

owdo

wn

Z2. Get(Follower)

Z3. Set(Leader)

Z4. Set(Follower)

Z5. Set(Follower)

C1. Write(quorum)(Data)

C2. Read(quorum)(Data)

110

1001000

Exec

utio

nSl

owdo

wn

C3. Get(one) + Write(all)

(Data)B1. Put

(Region Server)B2. Get

(Region Server)B3. Scan

(Region Server)B4. Cache Get/Put

(Data-H)

Figure 1: Limpbench Results. Each graph represents the result of each experiment (e.g., F1) described in Table 1. They-axis plots the slowdowns (in log scale) of an experiment under various limpware scenarios. In the first row, a limping disk isinjected. In the rest, a limping NIC is injected. The graphs show that cloud systems are crash tolerant, but not limpware tolerant.

0

0.2

0.4

0.6

0.8

1

0 200 400 600 800 1000 1200

Prog

ress

Sco

re

Time (second)

(a) Limplocked Reducers

Normal reducerLimplocked reducer 1Limplocked reducer 2Limplocked reducer 3 0

200 400 600 800

1000 1200

0 50 100 150 200 250 300 350

# of

Job

s Fi

nish

ed

Time (minute)

Job Throughput

Normalw/ 1 limpware

0

20

40

60

80

100

0 50 100 150 200 250 300 350

Lim

ploc

ked

Node

s (%

)

Time (minute)

Cascading Limplock

Figure 2: Hadoop Limplock. The graphs show (a) the progress scores of limplocked reducers of a job in experiment H1 (anormal reducer is shown for comparison), (b) cascades of node limplock due to a single limpware, and (c) a throughput collapseof a Hadoop cluster due to a limpware. For Figures (b) and (c), we ran a Facebook workload [1] on a 30-node cluster.

0 0.2 0.4 0.6 0.8

1

5 20 40 60 80 100

Prob

abilit

y

# Nodes

a) Read

r = 40r = 20r = 10r = 5r = 1

0 0.2 0.4 0.6 0.8

1

5 20 40 60 80 100# Nodes

b) Write

r = 40r = 20r = 10r = 5r = 1

0 0.2 0.4 0.6 0.8

1

5 20 40 60 80 100# Nodes

c) Block Regeneration

b = 3200b = 1600b = 800b = 400 0

0.2 0.4 0.6 0.8

1

5 20 40 60 80 100# Nodes

d) Cluster Regeneration

b = 6400b = 3200b = 1600b = 800b = 400

Figure 3: HDFS Limplock Probabilities. The figures plot the probabilities of (a) read limplock/Prl, (b) write limplock/Pwl,(c) block limplock/Pbl, (d) and cluster regeneration limplock/Pcl, as defined in Table 2. The x-axis plots cluster size.

3

13

q Macrobenchmark: Facebook workload §  30-node cluster §  One node w/ degraded NIC (0.1 Mbps)

Cluster collapse! Why?

1 job/hour

Page 14: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

14

Fail-stop tolerant, but not limpware tolerant (no failover recovery)

Page 15: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

15

q Introduction

q System analysis

q Formalizing the problem: Limplock §  Definitions and causes

q Limpware-Tolerant Systems

q Conclusion

Page 16: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

16

q Definition §  The system progresses slowly due to limpware

and is not capable of failing over to healthy components

§  (i.e., the system is “locked” in limping mode)

q 3 levels of limplock §  Operation §  Node §  Cluster

Page 17: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

17

q Operation Limplock §  Operation involving limpware is “locked” in limping

mode; no failover

q Node Limplock §  A situation where operations that must be served by

this node experience limplock, although the operations do not involve limpware

q Cluster Limplock §  The whole cluster is in limplock due to limpware

Page 18: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

18

q  Operation Limplock §  Single point of failure

-  Hadoop slow map task -  HBase “Gateway”

M1

M2

M3

Mappers Reducers

Page 19: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

19

q  Operation Limplock §  Single point of failure §  Coarse-grained timeout §  (more in the paper)

512MB write to HDFS

1Slow

dow

n

10

100

Reason: No timeout is triggered Coarse-grained timeout in HDFS 60 second timeout on every 64 KB Could limp almost to 1 KB/s

Page 20: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

20

q  Operation Limplock §  Single point of failure §  Coarse-grained timeout §  …

q  Node Limplock §  Bounded multi-purpose thread pool

In-memory meta reads Master

Meta writes

In-memory meta reads Master

In-memory reads > 100x slower than normal

1

Slow

dow

n

10

100

Resource exhaustion by limplocked operation In-memory metadata reads are blocked

Page 21: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

21

q  Operation Limplock §  Single point of failure §  Coarse-grained timeout §  …

q  Node Limplock §  Bounded multi-purpose thread pool §  Bounded multi-purpose queue

messages

messages

Page 22: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

22

q  Operation Limplock §  Single point of failure §  Coarse-grained timeout §  …

q  Node Limplock §  Bounded multi-purpose thread pool §  Bounded multi-purpose queue §  Unbounded thread pool/queue

-  Ex: Backlogged queue at leader -  Node limplock at leader because garbage collection works hard -  Quorum write: 10x slowdown

Stress load 20 seconds

1

Slow

dow

n

10

ZooKeeper Leader

Client quorum write

Followers

Stress load 600 seconds

Page 23: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

q  Operation Limplock §  Single point of failure §  Coarse-grained timeout §  …

q  Node Limplock §  Bounded multi-purpose thread pool §  Bounded multi-purpose queue §  Unbounded thread pool/queue

q  Cluster Limplock §  All nodes in limplock

-  Ex: resource exhaustion in Hadoop, HDFS Regeneration §  Master limplock in master-slave architecture

-  Ex: cases in ZooKeeper, HDFS

Page 24: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

24

q Found 15 protocols that exhibit limplock §  8 in HDFS §  1 in Hadoop §  2 in ZooKeeper §  4 in HBase

Limplock happens in almost all systems we have analyzed

Page 25: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

25

q Introduction

q System analysis

q Limplock

q Limpware-Tolerant Cloud Systems

q Conclusion

Page 26: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

q Anticipation §  Limpware-tolerant design

patterns §  Limpware static analysis §  Limpware statistics

-  Existing work: memory failure, disk failure, etc.

q Detection §  Performance degradation à

implicit (no hard errors) §  Study explicit causes (e.g.

block remapping, error correcting)

q Recovery §  How to “fail in place”? §  Better to fail-stop than

fail-slow? §  Quarantine?

q Utilization §  Fail-stop: fail or

working §  Limpware: degrade

1-100%

26

Page 27: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

27

q New failure modes à transform systems

q Limpware is a “new”, destructive failure mode §  Orders of magnitude slowdown §  Cascading failures §  No “fail in place” in current systems

A need for Limpware-Tolerant Systems

Page 28: Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa ...pages.cs.wisc.edu/~thanhdo/pdf/talk-socc-limplock.pdf · 1 Thanh Do*, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake,

28