self healing in streaming systems - university of washington

36
Self Healing in Streaming Systems #UW Database Day Dec 2nd, 2016 Karthik Ramasamy Twi/er @karthikz

Upload: others

Post on 30-Apr-2022

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Self Healing in Streaming Systems - University of Washington

Self Healing in Streaming Systems #UW Database Day

Dec 2nd, 2016

Karthik  Ramasamy  Twi/er

@karthikz

Page 2: Self Healing in Streaming Systems - University of Washington

2

What is self healing?

A  self  healing  system  adapts  itself  as  their  environmental  condi8ons  change  and  con8nue  to  produce  

results

Page 3: Self Healing in Streaming Systems - University of Washington

3

Why?Va

lue&of&Data&to&Decision

/Making&

Time&

Preven

8ve/&

Pred

ic8ve&

Ac8onable&

Reac8ve&Historical&

Real%&Time&

Seconds& Minutes& Hours& Days&

Tradi8onal&“Batch”&&&&&&&&&&&&&&&Business&&Intelligence&

Informa9on&Half%Life&In&Decision%Making&

Months&

Time/cri8cal&Decisions&

[1]  Courtesy  Michael  Franklin,  BIRTE,  2015.  

Page 4: Self Healing in Streaming Systems - University of Washington

4

Why?

G

Impact of downtime popular event such as

Super Bowl Oscars, etc

Ü

Impact of not honoring an SLA leading to penalty payments

!

Engineers & SRE burnt out attending to

incidents

increased productivityloss of revenue sla violations quality of life

With reduced incidents, engineers can focus on

actual development

s

Page 5: Self Healing in Streaming Systems - University of Washington

5

Real-Time Budget

BATCH

high throughput

> 1 hour

monthly active users relevance for ads

adhoc queries

REAL TIME

low latency

< 1 ms

Financial Trading

ad impressions count hash tag trends

approximate

10 ms - 1 sec

Near Real Time

latency sensitive

< 500 ms

fanout Tweets search for Tweets

deterministic workflows

OLTP

Page 6: Self Healing in Streaming Systems - University of Washington

6

Streaming Variants

STREAM PROCESSING

Analyze  and  act  on  data  with  con8nuous  queries  

STREAM WORKFLOWSCon8nuous   process   of   events  for  a  determinis8c  workflow

H

C

STREAM ANALYTICS

Con8nuously  analyze  data  using  mathema8cal  &  sta8s8cal  techinques

H

Page 7: Self Healing in Streaming Systems - University of Washington

7

Heron Topology

%

%

%

%

%

Spout 1

Spout 2

Bolt 1

Bolt 2

Bolt 3

Bolt 4

Bolt 5

Page 8: Self Healing in Streaming Systems - University of Washington

8

Heron Topology - Physical Execution

%

%

%

%

%

Spout 1

Spout 2

Bolt 1

Bolt 2

Bolt 3

Bolt 4

Bolt 5

%%

%%

%%

%%

%%

Page 9: Self Healing in Streaming Systems - University of Washington

9

Heron

Topology 1

TopologySubmission

Scheduler

Topology 2

Topology N

Page 10: Self Healing in Streaming Systems - University of Washington

10

Heron Topology Components

TopologyMaster

ZKCluster

Stream Manager

I1 I2 I3 I4

Stream Manager

I1 I2 I3 I4

Logical Plan, Physical Plan and Execution State

Sync Physical Plan

DATA CONTAINER DATA CONTAINER

Metrics Manager

Metrics Manager

MASTER CONTAINER

Page 11: Self Healing in Streaming Systems - University of Washington

11

Heron @Twitter

>  500  Real  Time  Jobs

500  Billions  Events/Day  PROCESSED

25-­‐200  MS  

latency

Page 12: Self Healing in Streaming Systems - University of Washington

12

Common Operational Issues

01 02 03

Slow Hosts Network Issues Data Skew

/.

-

Page 13: Self Healing in Streaming Systems - University of Washington

13

Slow Hosts

Memory Parity Errors

Impeding Disk Failures

Lower GHZ

G

!

g

Page 14: Self Healing in Streaming Systems - University of Washington

14

Heron Backpressure

% %S1 B2 B3

%B4

Page 15: Self Healing in Streaming Systems - University of Washington

15

Stream Manager

S1 B2

B3

Stream Manager

Stream Manager

Stream Manager

Stream Manager

S1 B2

B3 B4

S1 B2

B3

S1 B2

B3 B4

B4

Page 16: Self Healing in Streaming Systems - University of Washington

S1 S1

S1S1S1 S1

S1S1

16

Spout Backpressure

B2

B3

Stream Manager

Stream Manager

Stream Manager

Stream Manager

B2

B3 B4

B2

B3

B2

B3 B4

B4

Can  we  do  beWer?

Page 17: Self Healing in Streaming Systems - University of Washington

17

Network

Network Slowness

Network PartitioningG

g

Page 18: Self Healing in Streaming Systems - University of Washington

18

Network Slowness

Delays processing Data is accumulating

Timeliness of results is affected

I

Page 19: Self Healing in Streaming Systems - University of Washington

19

Network Slowness

Can  we  do  beWer?

Page 20: Self Healing in Streaming Systems - University of Washington

20

Network Partitioning

Stream Manager

TopologyMaster

TopologyMasterScheduler

Stream Manager

Stream Manager Scheduler Stream

Manager

Page 21: Self Healing in Streaming Systems - University of Washington

21

Network Partitioning

New Master Container

Acquiring Mastership in ZooKeeper fails

Master Container Dies

G

!

g

TopologyMasterScheduler

Page 22: Self Healing in Streaming Systems - University of Washington

22

Network Partitioning

TMaster thinks data container failed

Waits for the scheduler to reschedule new data container

Never happens

G

!

g

Stream Manager

TopologyMaster

Page 23: Self Healing in Streaming Systems - University of Washington

23

Network Partitioning

Cannot exchange data

Data accumulates

Chaos ensues!

G

!

g

Stream Manager

Stream Manager

Can  we  do  beWer?

Page 24: Self Healing in Streaming Systems - University of Washington

24

Network Partitioning

New data container spawned

TMaster realizes two data containers report as the same

Does not accept the new one and eventually it dies

G

!

g

Scheduler Stream Manager

Page 25: Self Healing in Streaming Systems - University of Washington

25

Data Skew

Multiple Keys

Several  keys  map  into  single  instance  and  their  count  is  high

Single KeySingle   key   maps   into   a  instance  and  its  count  is  high

H

C

Page 26: Self Healing in Streaming Systems - University of Washington

26

Data Skew - Multiple Keys

%

%

%

%

%

Spout 1

Spout 2

Bolt 1

Bolt 2

Bolt 3

Bolt 4

Bolt 5

%%

%%

%%

%%

%%

%%

Page 27: Self Healing in Streaming Systems - University of Washington

27

Data Skew - Single Key

%

%

%

%

%

Spout 1

Spout 2

Bolt 1

Bolt 2

Bolt 3

Bolt 4

Bolt 5

%%

%%

%%

%%

%%

%%%

What  happens  if  the  skew  is  temporary?

Page 28: Self Healing in Streaming Systems - University of Washington

28

Rack Failures

Multiple Rack

Outage  of  several  racks  simultaneously

Single RackOutage  of  a  single  rack

H

C

Page 29: Self Healing in Streaming Systems - University of Washington

29

Rack Local vs Rack DiversityRack Local Rack Diversity

Advantage  of  reduced  network  latency  and  high  p2p  bandwidth

Higher  network  latency  and  shared  bandwidth

Page 30: Self Healing in Streaming Systems - University of Washington

30

Single Rack Failure

Rack failureIf  job  is  rack  local  en8re  job  is  down  and  needs  to  be  rescheduled

Impact on real timelinessDepends  on  the  applica8on

{!

Rack Local

Page 31: Self Healing in Streaming Systems - University of Washington

31

Single Rack Failure

Rack Diversity Rack failureIf  job  is  rack  diverse  impact  is  minimal

Impact on real timelinessIs  affected  less  -­‐  depending  on  how  many  containers  are  running

{!

Page 32: Self Healing in Streaming Systems - University of Washington

32

Conclusion

"

Self  healing  is  important  in  Streaming  systems

Key  opera8onal  issues  -­‐  Slow  hosts,  network  and  data  skew

These  requirements  make  the  streaming  systems  more  complexG

"

Page 33: Self Healing in Streaming Systems - University of Washington

33

Interested in Heron?

CONTRIBUTIONS ARE WELCOME! https://github.com/twitter/heron

http://heronstreaming.io

HERON IS OPEN SOURCED

FOLLOW US @HERONSTREAMING

Page 34: Self Healing in Streaming Systems - University of Washington

34

WHAT WHY WHERE WHEN WHO HOW

Any Questions ???

Page 35: Self Healing in Streaming Systems - University of Washington

35

@karthikz Thanks For Listening

Page 36: Self Healing in Streaming Systems - University of Washington

36

Detections and Resolutions

Affects  real-­‐8me    (depends  on  applica8on  tolerance)   Keeps  up  with  real-­‐8me

Full  manual  interven8on N/A

Proac8ve

Reac8ve

DetecDon

s

Lazy Instant

ResoluDons