love devops? wait 'till you meet sre

54
NICK WRIGHT SRE MANAGER ATLASSIAN Love DevOps? Wait ‘til you meet SRE!

Upload: atlassian

Post on 16-Apr-2017

1.785 views

Category:

Software


0 download

TRANSCRIPT

NICK WRIGHT • SRE MANAGER • ATLASSIAN

Love DevOps? Wait ‘til you meet SRE!

S R E A N D H O W I T C A N H E L P

G E T T I N G S TA R T E D

O P S TO O L C H A I N

AgendaS E T T I N G T H E S C E N E

incidents per month

10+

incidents per month

100+

incidents per month

400+

incidents per month

900+

Too much firefighting

Caters News Agency

Fixing the same thing repeatedly

America’s Funniest Home Videos

Job Satisfaction

NASA

Service Ops

Application Development

S E T T I N G T H E S C E N E

S R E A N D H O W I T C A N H E L P

Agenda

G E T T I N G S TA R T E D

O P S TO O L C H A I N

Site Reliability Engineering

Preventative

Multiple distinct operations teams, or a You-Build-It, You-Run-It model.

SpecialisedEngineers focus on a single service or group of related services.

Decentralised

Primary focus: get away from break-fix, do work that prevents outages.

SRE vs DevOps?

SRE DevOps• Operations• Incident response• Post Mortems• Monitoring, Events, Alertings• Capacity planning• Primary focus: Reliability

• Delivery• Release automation• Environment builds• Config management• Infrastructure as code• Primary focus: Delivery Speed

Solutions

Balance

Interrupt vs

Preventative work

GravityGlue.com

Hire Devs!

And have a common

hiring pool

Always do Post-Mortems

Scrap the release meeting!

S E T T I N G T H E S C E N E

Agenda

S R E A N D H O W I T C A N H E L P

O P S TO O L C H A I N

G E T T I N G S TA R T E D

??

?

?

?

??

The journey to SRE

Improve

Define how the team will work and how we measure success

BuildGet the team up and running!

Vision

Revisit regularly - if its not working, tweak, change, refine.

Team StructureGoals and MetricsResponsibilities

Vision

In 6 months we will:

• Replace monitoring

• DR Plan and Test

How we measure success?

• Number of Incidents

• PIR Coverage

• Service list

• Service Owners

• Team Duties

Size and structure of team

Team Structure

Developer TeamsSRE

ToolsHiring

Build

Training

Get the team in place

• Start Early!

• Promotion Opportunities

• Existing hiring pipeline

Set things up so they can work!

• Last part of the talk!

• Bootcamps

• Wheel of Misfortune!

Regular check-ins

Improve

Review decisions

Change where needed

Blog success stories!

Does it work?!

100%Post Incident Review

Completion Rate

DR Compliance

The SRE team runs ahead of the rest of the team on reliability and encourages everyone to lift their game

A N D R E S E R N A , D E V M A N A G E R

“”

In the past the separate ops and dev teams would often pick the solution they were best positioned to implement. I like that our SRE team is able to pick the best solution to the problem instead.

J A M E S B U N TO N , D E V- O N - R O TAT I O N

S E T T I N G T H E S C E N E

Agenda

S R E A N D H O W I T C A N H E L P

O P S TO O L C H A I N

G E T T I N G S TA R T E D

Incident

Alerts

Dashboard Incident Ticket

HOT roomOps room

SREs

Atlassians

Ops JIRA Confluence

Run Book

Ops room

Ops JIRA

JQL

Select Action

JIRA HipChat Discussions

Incident Ticket

HOT room

Incident Ticket Pending

Fixing

Reviewing

Closed

Incident Ticket

ALL

MOST

FEW

ONE

Minor Impact Moderate Impact Severe Impact Outage

Incident Ticket

DetectFail Fix CloseRespond

JIRA ticket

Post Mortem

Incident Ticket

HOT roomOps room

SREs

Ops JIRAConfluence

JIRAActions!

Confluence

Confluence

Actions Linked Here

Incident Ticket

HOT roomOps room

SREs

Ops JIRAConfluence

JIRA

Actions!

Pending

Fixing

Reviewing

Closed

Draft

Approval

Published

Completed

JIRA

JIRATeam 1

JIRA

Team 2 Team 3

Reporting

JIRA

Summary

atlassian.com/careers

atlassian.com/help-desk

Pedro Canahuati

“Scaling the Operations Organisation at Facebook”

Ben Treynor

“Keys to SRE”

Thank you!

NICK WRIGHT • SRE MANAGER • ATLASSIAN