the incident lifecycle at new relic step 1: don’t panic

46
Presenter Name, Title and or Date The Incident Lifecycle at New Relic Step 1: Don’t Panic Nate Heinrich, Product Manager ©2008-15 New Relic, Inc. All rights reserved.

Upload: new-relic

Post on 18-Feb-2017

303 views

Category:

Data & Analytics


2 download

TRANSCRIPT

Presenter Name, Title and or Date

The Incident Lifecycle at New RelicStep 1: Don’t PanicNate Heinrich, Product Manager

©2008-15 New Relic, Inc. All rights reserved.  

This document and the information herein (including any information that may be incorporated by reference) is provided for informational purposes only and should not be construed as an offer, commitment, promise or obligation on behalf of New Relic, Inc. (“New Relic”) to sell securities or deliver any product, material, code, functionality, or other feature. Any information provided hereby is proprietary to New Relic and may not be replicated or disclosed without New Relic’s express written permission.Such information may contain forward-looking statements within the meaning of federal securities laws. Any statement that is not a historical fact or refers to expectations, projections, future plans, objectives, estimates, goals, or other characterizations of future events is a forward-looking statement. These forward-looking statements can often be identified as such because the context of the statement will include words such as “believes,” “anticipates,” “expects” or words of similar import.Actual results may differ materially from those expressed in these forward-looking statements, which speak only as of the date hereof, and are subject to change at any time without notice. Existing and prospective investors, customers and other third parties transacting business with New Relic are cautioned not to place undue reliance on this forward-looking information. The achievement or success of the matters covered by such forward-looking statements are based on New Relic’s current assumptions, expectations, and beliefs and are subject to substantial risks, uncertainties, assumptions, and changes in circumstances that may cause the actual results, performance, or achievements to differ materially from those expressed or implied in any forward-looking statement. Further information on factors that could affect such forward-looking statements is included in the filings we make with the SEC from time to time. Copies of these documents may be obtained by visiting New Relic’s Investor Relations website at ir.newrelic.com or the SEC’s website at www.sec.gov. New Relic assumes no obligation and does not intend to update these forward-looking statements, except as required by law. New Relic makes no warranties, expressed or implied, in this document or otherwise, with respect to the information provided.

©2008-15 New Relic, Inc. All rights reserved.  

©2008-15 New Relic, Inc. All rights reserved.  

©2008-15 New Relic, Inc. All rights reserved.  

©2008-15 New Relic, Inc. All rights reserved.  

Midincident

Observing Amazing

©2008-15 New Relic, Inc. All rights reserved.  

Experience with un-amazing

©2008-15 New Relic, Inc. All rights reserved.  

Background:

Product Manager:

Web Development and IT Operations

Focus on New Relic’s operations capabilities, especially Alerts

Common un-amazing conversations

©2008-15 New Relic, Inc. All rights reserved.  

Conversation Conversation

▪ “Own” 24 apps they didn’t write▪ Primary on-call▪ Restart app▪ Find authors or equivalent,

create a phone bridge…

▪ Technology is a cost center▪ Speed (to deploy) and performance

are undervalued▪Monitoring is a luxury

I’m a software

company?Sys

admins

Three areas of investment to achieve an awesome incident

lifecycle©2008-15 New Relic, Inc. All rights reserved.  

Incident timeline

©2008-15 New Relic, Inc. All rights reserved.  

Root cause

Hitsproduction

Detect Escalate Mitigate Resolve Actionitems

Retrospect

Incident duration

Detect, escalate Resolve

Three areas of investment

©2008-15 New Relic, Inc. All rights reserved.  

Root cause

Hitsproduction

Detect Escalate Mitigate Resolve Actionitems

Retrospect

Pre-incident

Culture

Routines

Priority

Incident

Post-incident

Incident duration

Detect, escalate Resolve

Three areas of investment

©2008-15 New Relic, Inc. All rights reserved.  

Root cause

Hitsproduction

Detect Escalate Mitigate Resolve Actionitems

Retrospect

Pre-incident

Culture

Routines

Priority

Incident

Post-incident

Incident duration

Detect, escalate Resolve

Three areas of investment

©2008-15 New Relic, Inc. All rights reserved.  

Root cause

Hitsproduction

Detect Escalate Mitigate Resolve Actionitems

Retrospect

Pre-incident

Routines

Priority

Incident

Post-incident

Incident duration

Detect, escalate Resolve

Culture

Tangent: This is probably related…

©2008-15 New Relic, Inc. All rights reserved.  

Intentional

IntentionalSoftware does what you tell

it to

ImmoralAmoral servers aren’t running your code

InstantaneousSlow degradation doesn’t get attentiononly alerts affecting you now

ImminentWorks in production today,

possible it will work tomorrow

CultureAvailability Medal Progression

©2008-15 New Relic, Inc. All rights reserved.  

Three areas of investment

©2008-15 New Relic, Inc. All rights reserved.  

Root cause

Hitsproduction

Detect Escalate Mitigate Resolve Actionitems

Retrospect

Pre-incident

Culture

Routines

Priority

Incident

Post-incident

Incident duration

Detect, escalate Resolve

Not all engineers have the same HA experience.

HA engineering is not trivial and can be difficult to approach.

©2008-15 New Relic, Inc. All rights reserved.  

Culture

©2008-15 New Relic, Inc. All rights reserved.  

Availability Medal Progression

Know Where You Are

Keep Your Software Running

Risks Are Fixed

Improve Availability

Programmatically

Level 1

(bronze)

Level 2

(silver)

Level 3

(gold)

Level 4

(platinum)

Culture

©2008-15 New Relic, Inc. All rights reserved.  

▪Basic monitoring▪Documented risk matrix

Level 1

(bronze)

Know Where You Are

Low Medium High

X XX High

X Medium

X X XX Low

Likelihood

Impa

ct

Culture

©2008-15 New Relic, Inc. All rights reserved.  

▪Build a culture where service status widely known▪Advanced monitoring

(observe issues early)▪Engage early▪Actionable data

Level 2

(silver)

Keep Your Software Running

Culture

©2008-15 New Relic, Inc. All rights reserved.  

▪Zero “high highs”▪Recurring gamedays▪Upstream and downstream

impacts known

Level 3

(gold)

Risks Are Fixed

Culture

©2008-15 New Relic, Inc. All rights reserved.  

▪Programmatic mitigation– Auto-scaling– Auto app instance killing– Retries & circuit breakers

Level 4

(platinum)

Improve Availability Programmatically

Culture

©2008-15 New Relic, Inc. All rights reserved.  

Outcome

Clear path for teams

Assistance along the

way

Teams know where they

stand

Aggregation across teams

helps management

RoutinesNrrd Incident Orchestration

©2008-15 New Relic, Inc. All rights reserved.  

Nrrd Incident Orchestration

©2008-15 New Relic, Inc. All rights reserved.  

Root cause

Hitsproduction

Detect Escalate Mitigate Resolve Actionitems

Retrospect

Pre-incident

Culture

Routines

Priority

Incident

Post-incident

Incident duration

Detect, escalate Resolve

Communication frequency and consistency.

Clear roles and “torch passing”.

©2008-15 New Relic, Inc. All rights reserved.  

Moar ChatOps!– Nrrd (Hubot)– HipChat– Incident / Retro tracking tool (internal)

Routines

©2008-15 New Relic, Inc. All rights reserved.  

Nrrd

©2008-15 New Relic, Inc. All rights reserved.  

©2008-15 New Relic, Inc. All rights reserved.  

©2008-15 New Relic, Inc. All rights reserved.  

Routines

©2008-15 New Relic, Inc. All rights reserved.  

Nrrd

Routines

©2008-15 New Relic, Inc. All rights reserved.  

Nrrd

Routines

©2008-15 New Relic, Inc. All rights reserved.  

Nrrd

Routines

©2008-15 New Relic, Inc. All rights reserved.  

Nrrd

Routines

©2008-15 New Relic, Inc. All rights reserved.  

Nrrd

Routines

©2008-15 New Relic, Inc. All rights reserved.  

Democratize incident creation

Managed role assignment

Timed status updates

Statuses saved as incident log for retros

Nrrd

PrioritiesDRI Policy & Unified Work Stream

©2008-15 New Relic, Inc. All rights reserved.  

Reliability PM & Unified Work Stream

©2008-15 New Relic, Inc. All rights reserved.  

Root cause

Hitsproduction

Detect Escalate Mitigate Resolve Actionitems

Retrospect

Pre-incident

Routines

Priority

Incident

Post-incident

Incident duration

Detect, escalate Resolve

Retro items all had the same priority.

Larger availability initiatives can’t compete.

©2008-15 New Relic, Inc. All rights reserved.  

Don’t Repeat Incidents Policy

©2008-15 New Relic, Inc. All rights reserved.  

Immediate Retro

Actions

Longer Term Holistic Actions

First merges to master post-incident

Tracked in the same place a feature work

Unified Work Stream

©2008-15 New Relic, Inc. All rights reserved.  

Just scratching the surface

©2008-15 New Relic, Inc. All rights reserved.  

Things I didn’t talk about…

©2008-15 New Relic, Inc. All rights reserved.  

Code, server and app

ownership

Disaster recovery exercises

The written culture

Chaos nerds

Hiring & Incentive

sGameDay

s

Deep dive on tooling

Security and

availability

Side-kicking

Final thought

©2008-15 New Relic, Inc. All rights reserved.  

Culture Routines

Priorities

©2008-15 New Relic, Inc. All rights reserved.  

Investing in your culture, routines and how you prioritize are all essential

Awesome!

©2008-15 New Relic, Inc. All rights reserved.  

Thank you.

Nate [email protected]