data driven testing: case study with apache helix
DESCRIPTION
Case study of how we used Helix not only to build the distributed system but also to test it. We built a Chaos monkey to simulate failures and developed tools in Helix to parse zookeeper transaction logs, controller and participant logs and reconstructed the exact sequence of steps that led to a failure. Once we get the exact sequence of steps, we reproduce the events using Helix for orchestration.TRANSCRIPT
Data Driven Tes,ng for Distributed Systems
Case study with Apache Helix
Kishore Gopalakrishna, @kishoreg1980 hBp://www.linkedin.com/in/kgopalak
Outline
• Intro to Helix • Use case: Distributed data store • Tradi,onal approach • Data driven tes,ng • Q & A
What is Helix
• Generic cluster management framework – Par,,on management – Failure detec,on and handling – Elas,city
4
Terminologies Node A single machine
Cluster Set of Nodes
Resource A logical en/ty e.g. database, index, task
Par,,on Subset of the resource.
Replica Copy of a par,,on
State Status of a par,,on replica, e.g Master, Slave
Transi,on Ac,on that lets replicas change status e.g Slave -‐> Master
Core concept: Augmented finite state machine
5
State Machine
• States • S1,S2,S3
• Transi,on • S1àS2, S2àS1, S2àS3, S3àS1
Constraints
• States • S1à max=1, S2=min=2
• Transi,ons • Concurrent(S1-‐>S2) across cluster < 5
Objec,ves
• Par,,on Placement • Failure seman,cs
Helix usage at LinkedIn
6
Espresso
Use case: Distributed data store
• Timeline consistent par,,oned data store • One master replica per par,,on • Even distribu,on of master/slave • On failure: promote slave to master
Node 1 Node 3 Node 2
P.4
P.9 P.10 P.11
P.12
P.1 P.2 P.3 P.7 P.5 P.6
P.8 P.1 P.5 P.6
P.9 P.10
P.4 P.3
P.7 P.8 P.11 P.12
P.2 P.1
COUNT=2
COUNT=1
minimize(maxnj∈N S(nj) ) t1≤5
8
S
M O
t1 t2
t3 t4 minimize(maxnj∈N M(nj) )
State Machine
• States • Offline, Slave, Master
• Transi,on • O-‐>S, S-‐>M,S-‐>M, M-‐>S
Constraints
• States • M=1, S=2
• Transi,ons • concurrent(0-‐>S) < 5
Objec,ves
• Par,,on Placement • Failure seman,cs
Helix based solu,on
Tes,ng
• Happy path func,onality – Meet SLA
• 99th percen,le latency etc – Writes to master
• Non happy path – System failures – Applica,on failures – How does system behave in such scenarios
Non happy path -‐ Tradi,onal approach
• Iden,fy scenarios of interest – Node failure – System upgrade
• Tested each scenario in isola,on via test case – All test passed J
• Deployed in alpha – First soiware upgrade failed … but we tested it
What was missing
• Failures don’t happen in isola,on • Induc,on principle does not work
– If something works once does not mean it will always work
• Lack of tools to debug issues – Could not iden,fy the cause from one log file
• Poor coverage – Impossible to think of all possible test cases
What we learnt
• Test with all components integrated • Simulate real produc,on environment
– Generate load – Random failures of mul,ple components
• BeBer debugging tools – Need to co-‐relate messages from mul,ple logs – Failure is a symptom, actual reason in past logs of different machine.
Data driven tes,ng
• Instrument – • Zookeeper, controller, par,cipant logs
• Simulate – Chaos monkey • Analyze – Invariants are
• Respect state transi,on constraints • Respect state count constraints • And so on
• Debugging made easy • Reproduce exact sequence of events
13
Chaos monkey
• Select a random component(s) to fail • How should it fail
– Hard/soi failure – Network Par,,on – Garbage collec,on – Process freeze
Automa,on of chaos monkey
• Helix agent on each node • Modify the behavior of each service using Helix – Component 1
• Node1: RUNNING • Node2: STOPPED • Node3: KILLED
– Component 2 • Node1: STOPPED
STOPPED
RUNNING
KILLED
FREEZED START
PAUSE
STOP
KILL
UNPAUSE
STATE MACHINE
Pseudo test case setup cluster generate load do
(c,t) = components to fail and type of failure simulate failure verify system_is_stable restart failed components
while(verify system_is_stable) Test case failed & here is the sequence of events
Cluster verifica,on
• Verify all constraints are sa,sfied – Is there a master for all par,,on – Is slave replica,ng – Node/component down should not maBer – Validate every ac,on not just end result
• Having master is not good enough, if two nodes became master and later one of them died.
Log analysis
• Log important events – Becoming master from slave for this par,,on at this ,me
• Tools to collect, merge & analyze logs – Parsed zookeeper transac,on logs – Gathered helix controller, par,cipant logs – Sorted on ,me.
• Helix provides these tools out of the box
Structured Log File – sample timestamp partition instanceName sessionId state
1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE
1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE
Benefits
• Test case stops as soon as system is unstable – The cluster is available for debugging
• Provides exact sequence of events – Makes it easy to debug and reproduce – Best part: We auto generated test case.
Reproduce the issue
Start state • Helix brings the system to
start state.
Orchestrate the sequence • Use Helix messaging api to
replay the events { "id" : "MyDataStore", "simpleFields" : { "IDEAL_STATE_MODE" : "CUSTOM", "NUM_PARTITIONS" : ”2", "REPLICAS" : "3", "STATE_MODEL_DEF_REF" : "MasterSlave", } "mapFields" : { "MyDataStore_0" : { "node1" : "MASTER", "node2" : "OFFLINE", "node3" : "SLAVE", }, "MyDataStore_0" : { "node1" : "SLAVE", "node2" : "OFFLINE", "node3" : "MASTER", }, } }
1. Node1:MyDataStore_0: Master-Slave
2. Node1:HARD KILL
3. Node2:START
Constraint viola,on
Time State Number Slaves Instance
42632 OFFLINE 0 10.117.58.247_12918
42796 SLAVE 1 10.117.58.247_12918
43124 OFFLINE 1 10.202.187.155_12918
43131 OFFLINE 1 10.220.225.153_12918
43275 SLAVE 2 10.220.225.153_12918
43323 SLAVE 3 10.202.187.155_12918
85795 MASTER 2 10.220.225.153_12918
No more than R=2 slaves
How long was it out of whack? Number of Slaves Time Percentage
0 1082319 0.5
1 35578388 16.46
2 179417802 82.99
3 118863 0.05
83% of the ,me, there were 2 slaves to a par,,on 93% of the ,me, there was 1 master to a par,,on
Number of Masters Time Percentage
0 15490456 7.164960359 1 200706916 92.83503964
Invariant 2: State Transi,ons FROM TO COUNT
MASTER SLAVE 55
OFFLINE DROPPED 0
OFFLINE SLAVE 298
SLAVE MASTER 155
SLAVE OFFLINE 0
Fun facts
• For almost a month the test failed to run successfully for one night
• Most issues were found using one test case • Reproduced almost all failures
Conclusion
• Tradi,onal approach is not good enough • Data driven tes,ng is way to go
– Focus on workload and analysis – Produc,on system always in test mode – Leverage tools built for tes,ng to debug produc,on issues