data driven testing: case study with apache helix

27
Data Driven Tes,ng for Distributed Systems Case study with Apache Helix Kishore Gopalakrishna, @kishoreg1980 hBp://www.linkedin.com/in/kgopalak

Upload: kishore-gopalakrishna

Post on 06-May-2015

2.268 views

Category:

Technology


1 download

DESCRIPTION

Case study of how we used Helix not only to build the distributed system but also to test it. We built a Chaos monkey to simulate failures and developed tools in Helix to parse zookeeper transaction logs, controller and participant logs and reconstructed the exact sequence of steps that led to a failure. Once we get the exact sequence of steps, we reproduce the events using Helix for orchestration.

TRANSCRIPT

Page 1: Data driven testing: Case study with Apache Helix

Data  Driven  Tes,ng  for    Distributed  Systems    

Case  study  with  Apache  Helix  

Kishore  Gopalakrishna,  @kishoreg1980  hBp://www.linkedin.com/in/kgopalak    

Page 2: Data driven testing: Case study with Apache Helix

Outline  

•  Intro  to  Helix  •  Use  case:  Distributed  data  store  •  Tradi,onal  approach  •  Data  driven  tes,ng  •  Q  &  A  

Page 3: Data driven testing: Case study with Apache Helix

What  is  Helix  

•  Generic  cluster  management  framework  – Par,,on  management  – Failure  detec,on  and  handling  – Elas,city  

Page 4: Data driven testing: Case study with Apache Helix

4  

Terminologies    Node   A  single  machine  

Cluster   Set  of  Nodes  

Resource   A  logical  en/ty  e.g.  database,  index,  task  

Par,,on   Subset  of  the  resource.  

Replica   Copy  of  a  par,,on  

State   Status  of  a  par,,on  replica,  e.g  Master,  Slave  

Transi,on   Ac,on  that  lets  replicas  change  status  e.g  Slave  -­‐>  Master  

Page 5: Data driven testing: Case study with Apache Helix

Core  concept:  Augmented  finite  state  machine  

5  

State  Machine  

• States  • S1,S2,S3  

• Transi,on  • S1àS2,  S2àS1,  S2àS3,  S3àS1    

Constraints  

• States  • S1à  max=1,  S2=min=2  

• Transi,ons  • Concurrent(S1-­‐>S2)  across  cluster  <  5    

Objec,ves  

• Par,,on  Placement  • Failure  seman,cs  

Page 6: Data driven testing: Case study with Apache Helix

Helix  usage  at  LinkedIn  

 

 

6  

Espresso  

Page 7: Data driven testing: Case study with Apache Helix

Use  case:  Distributed  data  store  

•  Timeline  consistent  par,,oned  data  store  •  One  master  replica  per  par,,on  •  Even  distribu,on  of  master/slave  •  On  failure:  promote  slave  to  master  

Node  1   Node  3  Node  2  

P.4  

P.9   P.10   P.11  

P.12  

P.1   P.2   P.3   P.7  P.5   P.6  

P.8   P.1  P.5   P.6  

P.9   P.10  

P.4  P.3  

P.7   P.8  P.11   P.12  

P.2  P.1  

Page 8: Data driven testing: Case study with Apache Helix

COUNT=2

COUNT=1

minimize(maxnj∈N  S(nj)  ) t1≤5

8  

S  

M  O  

t1 t2

t3 t4 minimize(maxnj∈N  M(nj)  )

State  Machine  

• States  • Offline,  Slave,  Master  

• Transi,on  • O-­‐>S,  S-­‐>M,S-­‐>M,  M-­‐>S  

Constraints  

• States  • M=1,  S=2  

• Transi,ons  • concurrent(0-­‐>S)  <  5    

Objec,ves  

• Par,,on  Placement  • Failure  seman,cs  

Helix  based  solu,on  

Page 9: Data driven testing: Case study with Apache Helix

Tes,ng  

•  Happy  path  func,onality  – Meet  SLA  

•    99th  percen,le  latency  etc  – Writes  to  master  

•  Non  happy  path  – System  failures    – Applica,on  failures  – How  does  system  behave  in  such  scenarios  

 

Page 10: Data driven testing: Case study with Apache Helix

Non  happy  path  -­‐  Tradi,onal  approach  

•  Iden,fy  scenarios  of  interest  – Node  failure  – System  upgrade  

•  Tested  each  scenario  in  isola,on  via  test  case  – All  test  passed  J  

•  Deployed  in  alpha  –  First  soiware  upgrade  failed  …  but  we  tested  it  

Page 11: Data driven testing: Case study with Apache Helix

What  was  missing  

•  Failures  don’t  happen  in  isola,on  •  Induc,on  principle  does  not  work  

–  If  something  works  once  does  not  mean  it  will  always  work  

•  Lack  of  tools  to  debug  issues  – Could  not  iden,fy  the  cause  from  one  log  file  

•  Poor  coverage  –  Impossible  to  think  of  all  possible  test  cases  

Page 12: Data driven testing: Case study with Apache Helix

What  we  learnt  

•  Test  with  all  components  integrated  •  Simulate  real  produc,on  environment  

– Generate  load  – Random  failures  of  mul,ple  components  

•  BeBer  debugging  tools  – Need  to  co-­‐relate  messages  from  mul,ple  logs  – Failure  is  a  symptom,  actual  reason  in  past  logs  of  different  machine.  

Page 13: Data driven testing: Case study with Apache Helix

Data  driven  tes,ng  

•  Instrument  –  •   Zookeeper,  controller,  par,cipant  logs  

•  Simulate  –  Chaos  monkey  •  Analyze  –  Invariants  are  

•  Respect  state  transi,on  constraints  •  Respect  state  count  constraints  •  And  so  on  

•  Debugging  made  easy  •  Reproduce  exact  sequence  of  events    

 13  

Page 14: Data driven testing: Case study with Apache Helix

Chaos  monkey  

•  Select  a  random  component(s)  to  fail  •  How  should  it  fail  

– Hard/soi  failure  – Network  Par,,on  – Garbage  collec,on  – Process  freeze  

 

Page 15: Data driven testing: Case study with Apache Helix

Automa,on  of  chaos  monkey  

•  Helix  agent  on  each  node  •  Modify  the  behavior  of  each  service  using  Helix  – Component  1  

•  Node1:  RUNNING  •  Node2:  STOPPED  •  Node3:  KILLED  

– Component  2  •  Node1:  STOPPED  

STOPPED  

RUNNING  

KILLED  

FREEZED   START  

PAUSE  

STOP  

KILL  

UNPAUSE  

STATE  MACHINE  

Page 16: Data driven testing: Case study with Apache Helix

Pseudo  test  case  setup  cluster    generate  load  do  

 (c,t)  =  components  to  fail  and  type  of  failure    simulate  failure    verify  system_is_stable    restart  failed  components  

while(verify  system_is_stable)  Test  case  failed  &  here  is  the  sequence  of  events    

   

Page 17: Data driven testing: Case study with Apache Helix

Cluster  verifica,on  

•  Verify  all  constraints  are  sa,sfied  –  Is  there  a  master  for  all  par,,on  –  Is  slave  replica,ng    – Node/component  down  should  not  maBer  – Validate  every  ac,on  not  just  end  result  

•  Having  master  is  not  good  enough,  if  two  nodes  became  master  and  later  one  of  them  died.  

Page 18: Data driven testing: Case study with Apache Helix

Log  analysis  

•  Log  important  events  – Becoming  master  from  slave  for  this  par,,on  at  this  ,me  

•  Tools  to  collect,  merge  &  analyze  logs  – Parsed  zookeeper  transac,on  logs  – Gathered  helix  controller,  par,cipant  logs  – Sorted  on  ,me.  

•  Helix  provides  these  tools  out  of  the  box  

Page 19: Data driven testing: Case study with Apache Helix

Structured  Log  File  –  sample  timestamp partition instanceName sessionId state

1323312236368 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236426 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236530 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236530 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236561 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236561 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236685 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236685 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236685 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236719 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236719 TestDB_91 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

1323312236719 TestDB_60 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc OFFLINE

1323312236814 TestDB_123 express1-md_16918 ef172fe9-09ca-4d77b05e-15a414478ccc SLAVE

Page 20: Data driven testing: Case study with Apache Helix

Benefits  

•  Test  case  stops  as  soon  as  system  is  unstable  – The  cluster  is  available  for  debugging    

•  Provides  exact  sequence  of  events  – Makes  it  easy  to  debug  and  reproduce  – Best  part:  We  auto  generated  test  case.        

Page 21: Data driven testing: Case study with Apache Helix

Reproduce  the  issue    

Start  state  •  Helix  brings  the  system  to  

start  state.    

Orchestrate  the  sequence  •  Use  Helix  messaging  api  to  

replay  the  events  { "id" : "MyDataStore", "simpleFields" : { "IDEAL_STATE_MODE" : "CUSTOM", "NUM_PARTITIONS" : ”2", "REPLICAS" : "3", "STATE_MODEL_DEF_REF" : "MasterSlave", } "mapFields" : { "MyDataStore_0" : { "node1" : "MASTER", "node2" : "OFFLINE", "node3" : "SLAVE", }, "MyDataStore_0" : { "node1" : "SLAVE", "node2" : "OFFLINE", "node3" : "MASTER", }, } }

1.  Node1:MyDataStore_0: Master-Slave

2. Node1:HARD KILL

3. Node2:START

Page 22: Data driven testing: Case study with Apache Helix

Constraint  viola,on  

Time State Number Slaves Instance

42632 OFFLINE 0 10.117.58.247_12918

42796 SLAVE 1 10.117.58.247_12918

43124 OFFLINE 1 10.202.187.155_12918

43131 OFFLINE 1 10.220.225.153_12918

43275 SLAVE 2 10.220.225.153_12918

43323 SLAVE 3 10.202.187.155_12918

85795 MASTER 2 10.220.225.153_12918

No  more  than  R=2  slaves  

Page 23: Data driven testing: Case study with Apache Helix

How  long  was  it  out  of  whack?  Number  of  Slaves   Time     Percentage  

0   1082319   0.5  

1   35578388   16.46  

2   179417802   82.99  

3   118863   0.05  

83%  of  the  ,me,  there  were  2  slaves  to  a  par,,on  93%  of  the  ,me,  there  was  1  master  to  a  par,,on  

Number  of  Masters   Time   Percentage  

0 15490456 7.164960359 1 200706916 92.83503964

Page 24: Data driven testing: Case study with Apache Helix

Invariant  2:  State  Transi,ons  FROM   TO   COUNT  

MASTER SLAVE 55

OFFLINE DROPPED 0

OFFLINE SLAVE 298

SLAVE MASTER 155

SLAVE OFFLINE 0

Page 25: Data driven testing: Case study with Apache Helix

Fun  facts  

•  For  almost  a  month  the  test  failed  to  run  successfully  for  one  night  

•  Most  issues  were  found  using  one  test  case  •  Reproduced  almost  all  failures  

Page 26: Data driven testing: Case study with Apache Helix

Conclusion  

•  Tradi,onal  approach  is  not  good  enough  •  Data  driven  tes,ng  is  way  to  go  

– Focus  on  workload  and  analysis  – Produc,on  system  always  in  test  mode  – Leverage  tools  built  for  tes,ng  to  debug  produc,on  issues  

Page 27: Data driven testing: Case study with Apache Helix

27  

website   helix.incubator.apache.org  

users   [email protected]  

dev   [email protected]  

twiBer   @apachehelix