data driven testing: case study with apache helix

Data Driven Tes,ng for Distributed Systems

Case study with Apache Helix

Kishore Gopalakrishna, @kishoreg1980 hBp://www.linkedin.com/in/kgopalak

Outline

•  Intro to Helix •  Use case: Distributed data store •  Tradi,onal approach •  Data driven tes,ng •  Q & A

What is Helix

•  Generic cluster management framework – Par,,on management – Failure detec,on and handling – Elas,city

4

Terminologies Node A single machine

Cluster Set of Nodes

Resource A logical en/ty e.g. database, index, task

Par,,on Subset of the resource.

Replica Copy of a par,,on

State Status of a par,,on replica, e.g Master, Slave

Transi,on Ac,on that lets replicas change status e.g Slave -‐> Master

Core concept: Augmented finite state machine

5

State Machine

• States • S1,S2,S3

• Transi,on • S1àS2, S2àS1, S2àS3, S3àS1

Constraints

• States • S1à max=1, S2=min=2

• Transi,ons • Concurrent(S1-‐>S2) across cluster < 5

Objec,ves

• Par,,on Placement • Failure seman,cs

Helix usage at LinkedIn

6

Espresso

Use case: Distributed data store

•  Timeline consistent par,,oned data store •  One master replica per par,,on •  Even distribu,on of master/slave •  On failure: promote slave to master

Node 1 Node 3 Node 2

P.4

P.9 P.10 P.11

P.12

P.1 P.2 P.3 P.7 P.5 P.6

P.8 P.1 P.5 P.6

P.9 P.10

P.4 P.3

P.7 P.8 P.11 P.12

P.2 P.1

COUNT=2

COUNT=1

minimize(maxnj∈N S(nj) ) t1≤5

8

S

M O

t1 t2

t3 t4 minimize(maxnj∈N M(nj) )

State Machine

• States • Offline, Slave, Master

• Transi,on • O-‐>S, S-‐>M,S-‐>M, M-‐>S

Constraints

• States • M=1, S=2

• Transi,ons • concurrent(0-‐>S) < 5

Objec,ves

• Par,,on Placement • Failure seman,cs

Helix based solu,on

Tes,ng

•  Happy path func,onality – Meet SLA

•  99th percen,le latency etc – Writes to master

•  Non happy path – System failures – Applica,on failures – How does system behave in such scenarios

Non happy path -‐ Tradi,onal approach

•  Iden,fy scenarios of interest – Node failure – System upgrade

•  Tested each scenario in isola,on via test case – All test passed J

•  Deployed in alpha –  First soiware upgrade failed … but we tested it

What was missing

•  Failures don’t happen in isola,on •  Induc,on principle does not work

–  If something works once does not mean it will always work

•  Lack of tools to debug issues – Could not iden,fy the cause from one log file

•  Poor coverage –  Impossible to think of all possible test cases

What we learnt

•  Test with all components integrated •  Simulate real produc,on environment

– Generate load – Random failures of mul,ple components

•  BeBer debugging tools – Need to co-‐relate messages from mul,ple logs – Failure is a symptom, actual reason in past logs of different machine.

Data driven tes,ng

•  Instrument – •  Zookeeper, controller, par,cipant logs

•  Simulate – Chaos monkey •  Analyze – Invariants are

•  Respect state transi,on constraints •  Respect state count constraints •  And so on

•  Debugging made easy •  Reproduce exact sequence of events

13

Chaos monkey

•  Select a random component(s) to fail •  How should it fail

– Hard/soi failure – Network Par,,on – Garbage collec,on – Process freeze

Automa,on of chaos monkey

•  Helix agent on each node •  Modify the behavior of each service using Helix – Component 1

•  Node1: RUNNING •  Node2: STOPPED •  Node3: KILLED

– Component 2 •  Node1: STOPPED

STOPPED

RUNNING

KILLED

FREEZED START

PAUSE

STOP

KILL

UNPAUSE

STATE MACHINE

Pseudo test case setup cluster generate load do

(c,t) = components to fail and type of failure simulate failure verify system_is_stable restart failed components

while(verify system_is_stable) Test case failed & here is the sequence of events

Cluster verifica,on

•  Verify all constraints are sa,sfied –  Is there a master for all par,,on –  Is slave replica,ng – Node/component down should not maBer – Validate every ac,on not just end result

•  Having master is not good enough, if two nodes became master and later one of them died.

Log analysis

•  Log important events – Becoming master from slave for this par,,on at this ,me

•  Tools to collect, merge & analyze logs – Parsed zookeeper transac,on logs – Gathered helix controller, par,cipant logs – Sorted on ,me.

•  Helix provides these tools out of the box

Structured Log File – sample timestamp partition instanceName sessionId state

1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE




1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE









Benefits

•  Test case stops as soon as system is unstable – The cluster is available for debugging

•  Provides exact sequence of events – Makes it easy to debug and reproduce – Best part: We auto generated test case.

Reproduce the issue

Start state •  Helix brings the system to

start state.

Orchestrate the sequence •  Use Helix messaging api to

replay the events { "id" : "MyDataStore", "simpleFields" : { "IDEAL_STATE_MODE" : "CUSTOM", "NUM_PARTITIONS" : ”2", "REPLICAS" : "3", "STATE_MODEL_DEF_REF" : "MasterSlave", } "mapFields" : { "MyDataStore_0" : { "node1" : "MASTER", "node2" : "OFFLINE", "node3" : "SLAVE", }, "MyDataStore_0" : { "node1" : "SLAVE", "node2" : "OFFLINE", "node3" : "MASTER", }, } }

1.  Node1:MyDataStore_0: Master-Slave

2. Node1:HARD KILL

3. Node2:START

Constraint viola,on

Time State Number Slaves Instance

42632 OFFLINE 0 10.117.58.247_12918

42796 SLAVE 1 10.117.58.247_12918

43124 OFFLINE 1 10.202.187.155_12918

43131 OFFLINE 1 10.220.225.153_12918

43275 SLAVE 2 10.220.225.153_12918

43323 SLAVE 3 10.202.187.155_12918

85795 MASTER 2 10.220.225.153_12918

No more than R=2 slaves

How long was it out of whack? Number of Slaves Time Percentage

0 1082319 0.5

1 35578388 16.46

2 179417802 82.99

3 118863 0.05

83% of the ,me, there were 2 slaves to a par,,on 93% of the ,me, there was 1 master to a par,,on

Number of Masters Time Percentage

0 15490456 7.164960359 1 200706916 92.83503964

Invariant 2: State Transi,ons FROM TO COUNT

MASTER SLAVE 55

OFFLINE DROPPED 0

OFFLINE SLAVE 298

SLAVE MASTER 155

SLAVE OFFLINE 0

Fun facts

•  For almost a month the test failed to run successfully for one night

•  Most issues were found using one test case •  Reproduced almost all failures

Conclusion

•  Tradi,onal approach is not good enough •  Data driven tes,ng is way to go

– Focus on workload and analysis – Produc,on system always in test mode – Leverage tools built for tes,ng to debug produc,on issues

27

website helix.incubator.apache.org

users [email protected]

dev [email protected]

twiBer @apachehelix

data driven testing: case study with apache helix

Technology