webinar: victorops blameless post-mortems

44
#VOwebinar @GoVictorOps (Blameless) post-mortems @jasonhand

Upload: victorops

Post on 19-Dec-2014

135 views

Category:

Internet


2 download

DESCRIPTION

- Real-life stories of blameless post-mortems in action - An understanding of why blaming members of a team for an outage or issue is counter productive - Actionable steps to help you perform post-mortems in a blameless manner.

TRANSCRIPT

  • 1. (Blameless)post-mortems@GoVictorOps #VOwebinar@jasonhand

2. Jason [email protected]@jasonhand@GoVictorOps #VOwebinar@jasonhand 3. Tara CalihmanSocial MediaMarketing Director@GoVictorOps #VOwebinar@Tarable 4. A little about meDir. of Platform Support - AppDirectDir. of Technical Support - StandingCloudDir. of Operational Systems - AFI SupplyHiker, climber, brewer, runner, biker, boarder, surfer,painter, singer, reader, writer, picker, coder, racer,camper, volunteer . all the usual Colorado 1-upper@GoVictorOps #VOwebinar@jasonhand 5. Alternative namesAlso known as: (Note: Public & Internal)Project RetrospectivesPost-mortem analysis Post-project reviewQuality Improvement ReviewProject Analysis ReviewAfter Action ReviewAutopsy ReviewSantayana ReviewTouchdown Meeting@GoVictorOps #VOwebinar@jasonhandLearning Review 6. Post-mortemDefinedWhat ?A process intended to inform improvements bydetermining aspects that were successful orunsuccessful.@GoVictorOps #VOwebinar@jasonhand 7. Post-mortemDefinedWhen ?As soon as feasible after the Incident is resolved.@GoVictorOps #VOwebinar@jasonhand 8. Post-mortemPollWho should be involved in a blameless post-mortem?A. ManagementB. The Dev & Ops teamsC. Only those that played a part in the outage &resolutionD. All of the above@GoVictorOps #VOwebinar@jasonhand 9. Post-mortemAnswerD. All of theabovei.e. EverybodyWho ?@GoVictorOps #VOwebinar@jasonhand 10. Post-mortemDefinedTo communicate with your teamWhy ?To understand what happened for learning andimproving@GoVictorOps #VOwebinar@jasonhand 11. Post-mortemDefinedTalk about the incident timelineEscalation stepsWhat was done to resolve theproblemCreate a remediation planMake it availableHow ?@GoVictorOps #VOwebinar@jasonhand 12. The Three RsRegretAcknowledgement and apologyReasonInitial incident detection to resolution,including the so-called root causes.RemedyActionable remediation itemsDave ZwiebackVP Engineering - Next Big Sound@GoVictorOps #VOwebinar@jasonhand( simple format ) 13. Moving from Reaction to Action(Remedy)Use SMART recommendationsSpecificMeasurableAgreed Upon/AgreeableRealisticTimebound@GoVictorOps #VOwebinar@jasonhand 14. Blameless@GoVictorOps #VOwebinarimage from Across the Universe#VOwebinar 15. Cool story, bro2011 - Hired to Standing CloudCloud marketplace & automated deployment ofappsBuild Support teamProvide Managed services@GoVictorOps #VOwebinar@jasonhand 16. Sydney DekkerReprimanding bad apples mayseem like a quick and rewardingfix, but its like peeing in yourpants.You feel relieved and perhaps evennice and warm for a little while,but then it gets cold anduncomfortable.And you look like a fool@GoVictorOps #VOwebinar@jasonhand 17. What is a blamelesspost-mortem?Team members are accountable but not responsibleComplete TransparencyDeeper look at circumstancesWhat happened and how to improve it (specificdetails)Real conditions of failure in complex systemsAvoid counterfactuals@GoVictorOps #VOwebinar@jasonhand 18. Your organization must continually affirmthat individuals are NEVER the rootcause of outages. Dave Zwieback@GoVictorOps #VOwebinar@jasonhand 19. WhyvsHow@GoVictorOps #VOwebinar 20. Paraphrased from Fallible Humans by Ian Malpass- DevOpsDays - Minneapolissource: http://www.indecorous.com/fallible_humans/@GoVictorOps #VOwebinar@jasonhand 21. ETTO(Efficiency Thoroughness Trade Off)The trade off between:being efficientvsbeing thoroughEfficientThorough@GoVictorOps #VOwebinar@jasonhand 22. We can be thorough and really dig into thetask at hand and understand it well but thistakes time:It is inefficient.- Ian Malpass@GoVictorOps #VOwebinar@jasonhand 23. Cause & Effectsource: http://xkcd.comThere are many factors that played a part in theproblemmay be@GoVictorOps #VOwebinar@jasonhand 24. How many times does the letter Fappear in the following sentence?Finished files are the re-sult@GoVictorOps #VOwebinar@jasonhandof years of scientificstudy combined with theexperience of many years. 25. How many times do you see the letter F@GoVictorOps #VOwebinar@jasonhandA. 3B. 4C. 5D. 6 26. @GoVictorOps #VOwebinar@jasonhandAnswer:6Cognitive Bias 27. Stress& Cognitive Bias@GoVictorOps #VOwebinar@jasonhand 28. @GoVictorOps #VOwebinar@jasonhandIs stress good or bad?A.GoodB.Bad 29. Yerkes-Dodson Model@GoVictorOps #VOwebinarsource: The Human Side of Postmortems@jasonhand 30. @GoVictorOps #VOwebinar@jasonhandIs stress good or bad?Answer:Both 31. Reduce Stress? buildmuscle memorySimulate many types ofproblems and outages aspractice @GoVictorOps #VOwebinar@jasonhand#VOwebinar 32. Evaluative ThreatBeing negativelyjudged plays a big rolein stress@GoVictorOps #VOwebinar@jasonhand#VOwebinar 33. What is stress surface?Variables of a situationNovel or unusualUnpredictableControllable situationNegative judgementRelationshipsHealthProblems at homeLack of sleep@GoVictorOps #VOwebinar@jasonhandEvaluative threatsALSOEtc 34. Capturing theHuman-sideAsk questions@GoVictorOps #VOwebinar@jasonhand 35. Stress Questionnaire0 = Never 1 = Almost Never 2 = Sometimes3 = Fairly Often 4 = Very OftenDuring the outage, how often have you felt or thoughtthat:The situation was novel or unusual?The situation was unpredictable?You were unable to control the situation?Others could judge your actions negatively?@GoVictorOps #VOwebinar@jasonhand 36. Why we DONT punishDe-incentivized to give the detailsPractically guarantees a repeat of the problemUnderstand why actions made sense (at the time)Create safety AND accountabilityMove away from idea of individuals are problemsCreate new experts@GoVictorOps #VOwebinar@jasonhand 37. @GoVictorOps #VOwebinar@jasonhand#VOwebinar 38. Promoting from withinWhere do we start?The basics: Document your timeline or log data Document conversations Leave room for notes Mean time to resolution / Time calculations Level of severity Archive it for historical retrieval Remediation. Make it actionable@GoVictorOps #VOwebinar@jasonhand 39. ToolsEtsys MorgueVictorOpsPost-mortem Report@GoVictorOps #VOwebinar@jasonhandInternal Wiki 40. VictorOpsPost-MortemReportEtsysMorgueTool@GoVictorOps #VOwebinar 41. Seek the truthDont blame others Dont blame yourselfThank You@GoVictorOps #VOwebinar@jasonhand 42. Questions ?@GoVictorOps #VOwebinar@jasonhand 43. Next Webinar:ChatOps Unpluggedhttp://victorops.com/chatops-webinarTry VictorOps!@GoVictorOps #VOwebinar@jasonhandNow What?See for yourself how we're solvingthe problem of post-mortemreporting, on-call scheduling, alertmanagement and remotecollaboration. Start your free trialor join our weekly product demoat www.victorops.comPost-mortem Guides 44. ResourcesThe Human Side of Postmortems - Dave ZwiebackThe Field Guide to Understanding Human Error - Sydney DekkerA Look at Looking in the Mirror - J. Paul ReedFallible Humans - Ian Malpass (http://www.indecorous.com/fallible_humans/)4 Questions to ask for an effective Technical Post Mortem - Jeffrey OBrien(http://www.maintenanceassistant.com/blog/4-questions-effective-technical-post-mortem/)Nine steps to IT post-mortem excellence - Michael Krigsman (http://www.zdnet.com/blog/projectfailures/nine-steps-to-it-post-mortem-excellence/1069)Postmortem reviews: purpose and approaches in software engineering - Torgeir Dingsyr(http://www.uio.no/studier/emner/matnat/ifi/INF5180/v10/undervisningsmateriale/reading-materials/p08/post-mortems.pdf)Blameless PostMortems and a Just Culture - John Allspaw (http://codeascraft.com/2012/05/22/blameless-postmortems/)What blameless really means - Jessica Harllee (http://www.jessicaharllee.com/notes/what-blameless-really-means/)Each necessary, but only jointly sufficient - John Allspaw (http://www.kitchensoap.com/2012/02/10/each-necessary-@GoVictorOps but-only-jointly-sufficient/)#VOwebinar@jasonhand