self healing in streaming systems - university of washington
TRANSCRIPT
Self Healing in Streaming Systems #UW Database Day
Dec 2nd, 2016
Karthik Ramasamy Twi/er
@karthikz
2
What is self healing?
A self healing system adapts itself as their environmental condi8ons change and con8nue to produce
results
3
Why?Va
lue&of&Data&to&Decision
/Making&
Time&
Preven
8ve/&
Pred
ic8ve&
Ac8onable&
Reac8ve&Historical&
Real%&Time&
Seconds& Minutes& Hours& Days&
Tradi8onal&“Batch”&&&&&&&&&&&&&&&Business&&Intelligence&
Informa9on&Half%Life&In&Decision%Making&
Months&
Time/cri8cal&Decisions&
[1] Courtesy Michael Franklin, BIRTE, 2015.
4
Why?
G
Impact of downtime popular event such as
Super Bowl Oscars, etc
Ü
Impact of not honoring an SLA leading to penalty payments
!
Engineers & SRE burnt out attending to
incidents
increased productivityloss of revenue sla violations quality of life
With reduced incidents, engineers can focus on
actual development
s
5
Real-Time Budget
BATCH
high throughput
> 1 hour
monthly active users relevance for ads
adhoc queries
REAL TIME
low latency
< 1 ms
Financial Trading
ad impressions count hash tag trends
approximate
10 ms - 1 sec
Near Real Time
latency sensitive
< 500 ms
fanout Tweets search for Tweets
deterministic workflows
OLTP
6
Streaming Variants
STREAM PROCESSING
Analyze and act on data with con8nuous queries
STREAM WORKFLOWSCon8nuous process of events for a determinis8c workflow
H
C
STREAM ANALYTICS
Con8nuously analyze data using mathema8cal & sta8s8cal techinques
H
7
Heron Topology
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
8
Heron Topology - Physical Execution
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
%%
%%
%%
%%
%%
9
Heron
Topology 1
TopologySubmission
Scheduler
Topology 2
Topology N
10
Heron Topology Components
TopologyMaster
ZKCluster
Stream Manager
I1 I2 I3 I4
Stream Manager
I1 I2 I3 I4
Logical Plan, Physical Plan and Execution State
Sync Physical Plan
DATA CONTAINER DATA CONTAINER
Metrics Manager
Metrics Manager
MASTER CONTAINER
11
Heron @Twitter
> 500 Real Time Jobs
500 Billions Events/Day PROCESSED
25-‐200 MS
latency
12
Common Operational Issues
01 02 03
Slow Hosts Network Issues Data Skew
/.
-
13
Slow Hosts
Memory Parity Errors
Impeding Disk Failures
Lower GHZ
G
!
g
14
Heron Backpressure
% %S1 B2 B3
%B4
15
Stream Manager
S1 B2
B3
Stream Manager
Stream Manager
Stream Manager
Stream Manager
S1 B2
B3 B4
S1 B2
B3
S1 B2
B3 B4
B4
S1 S1
S1S1S1 S1
S1S1
16
Spout Backpressure
B2
B3
Stream Manager
Stream Manager
Stream Manager
Stream Manager
B2
B3 B4
B2
B3
B2
B3 B4
B4
Can we do beWer?
17
Network
Network Slowness
Network PartitioningG
g
18
Network Slowness
Delays processing Data is accumulating
Timeliness of results is affected
I
19
Network Slowness
Can we do beWer?
20
Network Partitioning
Stream Manager
TopologyMaster
TopologyMasterScheduler
Stream Manager
Stream Manager Scheduler Stream
Manager
21
Network Partitioning
New Master Container
Acquiring Mastership in ZooKeeper fails
Master Container Dies
G
!
g
TopologyMasterScheduler
22
Network Partitioning
TMaster thinks data container failed
Waits for the scheduler to reschedule new data container
Never happens
G
!
g
Stream Manager
TopologyMaster
23
Network Partitioning
Cannot exchange data
Data accumulates
Chaos ensues!
G
!
g
Stream Manager
Stream Manager
Can we do beWer?
24
Network Partitioning
New data container spawned
TMaster realizes two data containers report as the same
Does not accept the new one and eventually it dies
G
!
g
Scheduler Stream Manager
25
Data Skew
Multiple Keys
Several keys map into single instance and their count is high
Single KeySingle key maps into a instance and its count is high
H
C
26
Data Skew - Multiple Keys
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
%%
%%
%%
%%
%%
%%
27
Data Skew - Single Key
%
%
%
%
%
Spout 1
Spout 2
Bolt 1
Bolt 2
Bolt 3
Bolt 4
Bolt 5
%%
%%
%%
%%
%%
%%%
What happens if the skew is temporary?
28
Rack Failures
Multiple Rack
Outage of several racks simultaneously
Single RackOutage of a single rack
H
C
29
Rack Local vs Rack DiversityRack Local Rack Diversity
Advantage of reduced network latency and high p2p bandwidth
Higher network latency and shared bandwidth
30
Single Rack Failure
Rack failureIf job is rack local en8re job is down and needs to be rescheduled
Impact on real timelinessDepends on the applica8on
{!
Rack Local
31
Single Rack Failure
Rack Diversity Rack failureIf job is rack diverse impact is minimal
Impact on real timelinessIs affected less -‐ depending on how many containers are running
{!
32
Conclusion
"
Self healing is important in Streaming systems
Key opera8onal issues -‐ Slow hosts, network and data skew
These requirements make the streaming systems more complexG
"
33
Interested in Heron?
CONTRIBUTIONS ARE WELCOME! https://github.com/twitter/heron
http://heronstreaming.io
HERON IS OPEN SOURCED
FOLLOW US @HERONSTREAMING
34
WHAT WHY WHERE WHEN WHO HOW
Any Questions ???
35
@karthikz Thanks For Listening
36
Detections and Resolutions
Affects real-‐8me (depends on applica8on tolerance) Keeps up with real-‐8me
Full manual interven8on N/A
Proac8ve
Reac8ve
DetecDon
s
Lazy Instant
ResoluDons