a crash course in building site reliability
TRANSCRIPT
![Page 1: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/1.jpg)
Building Site Reliability
Engineering:
A Crash Course
Amin Astaneh, Acquia Inc.
![Page 2: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/2.jpg)
Who am I?
● Senior Manager, SRE at Acquia
● Was in Operations Team from Dec
2010 - Nov 2015
● Built and Lead the Site Reliability
Engineering Team
![Page 3: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/3.jpg)
Agenda
● What is SRE?
● Why Do SRE?
● Acquia, Pre-SRE
● How Acquia Does SRE
● Building an SRE Competency
● How to Hire SREs?
● 1-Year Retrospective
![Page 4: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/4.jpg)
What is SRE?
![Page 5: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/5.jpg)
What is SRE?
“What happens when a software engineer is tasked with what used to be called
operations.”
- Ben Treynor, Google
![Page 6: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/6.jpg)
What is SRE?
SRE takes the manual processes associated with Operations..
![Page 7: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/7.jpg)
What is SRE?
..and replaces them with automation using software engineering.
![Page 8: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/8.jpg)
What is SRE?
They also use a set of methodologies and best practices that help engineering
teams create a mature and sustainable process for service ownership.
![Page 9: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/9.jpg)
How Does This Relate to DevOps?
DevOps is a set of values, tools, and processes that allow teams to best deliver
value to the customer.
Therefore, SRE can be considered a specific implementation of DevOps.
![Page 10: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/10.jpg)
SRE Practices(according to Google)
![Page 11: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/11.jpg)
1)Hire only coders.
![Page 12: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/12.jpg)
2) Have SLO(s) for your service.
![Page 13: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/13.jpg)
What are SLOs?
● SLI: Service Level Indicators (What to Measure)
● SLOs: Service Level Objectives (Targets for Measurements)
● SLAs: Service Level Agreements (Consequences for Missing Targets)
![Page 14: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/14.jpg)
3) Measure and report performance
against the SLO(s).
![Page 15: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/15.jpg)
4) Use Error Budgets and gate launches
on them.
![Page 16: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/16.jpg)
5) Have a common staffing pool for SRE
and developers.
![Page 17: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/17.jpg)
6) Cap SRE operational load at 50%.
![Page 18: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/18.jpg)
7) Have excess Ops work overflow to the
Dev Team.
![Page 19: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/19.jpg)
8) Share 5% of Ops work with the Dev
Team.
![Page 20: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/20.jpg)
9) Oncall teams should have at least
eight people at one location, or 6 people
at each of multiple locations.
![Page 21: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/21.jpg)
10) Aim for a maximum of two events per
oncall shift.
![Page 22: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/22.jpg)
11) Do a postmortem for every event.
![Page 23: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/23.jpg)
12) Postmortems are blameless and
focus on process and technology, not
people.
![Page 24: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/24.jpg)
Why Do SRE?
![Page 25: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/25.jpg)
Scale
![Page 26: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/26.jpg)
Improve Employees’ Quality of Life
![Page 27: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/27.jpg)
REDUCE COST
![Page 28: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/28.jpg)
Acquia, Pre-SRE
![Page 29: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/29.jpg)
![Page 30: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/30.jpg)
![Page 31: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/31.jpg)
![Page 32: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/32.jpg)
![Page 33: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/33.jpg)
Things We Tried First
● Implemented Kanban for Ops to make work visible and maximize throughput
● Did ‘Tier 2 Sprints’ to build automation for the team
● Generated team metrics to influence decision-making
“People Metrics: How to Use Team Data to Produce Positive Change”
https://events.drupal.org/dublin2016/sessions/people-metrics
![Page 34: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/34.jpg)
How Acquia Does SRE
![Page 35: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/35.jpg)
How Acquia Does SRE
Acquia SRE was commissioned as the driving force of our DevOps Initiative,
which has the following core values:
● Eliminate Toil
● No Capes
● Deliver With Empathy
● Own Your Service
● Own Your Business
● Own Customer Success
![Page 36: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/36.jpg)
Acquia SRE vs Google SRE
● We embed engineers on teams, rather than build teams that run services on
behalf of engineers
● The entire engineering team (plus the SRE) is expected to ‘own their service’,
with the SRE providing leadership on how to best handle those
responsibilities
● The SRE identifies risk as part of their day-to-day and brings improvement
opportunities directly to the Product Manager for prioritization
![Page 37: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/37.jpg)
Acquia SRE vs Google SRE
● We evaluate with Engineering and Product what the most critical projects are
on a quarterly basis, and allocate the team to best meet the present need
● We still reserve the right to remove engineers if an engagement becomes
untenable, though it has not yet been necessary
● We have a heavy focus on time tracking to aid in toil reduction
![Page 38: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/38.jpg)
8) Share 5% of Ops work with the Dev
Team.
![Page 39: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/39.jpg)
8) Share 5% of Ops work with the Dev
Team.
![Page 40: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/40.jpg)
8) Ops work IS the responsibility of the
Dev Team.
![Page 41: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/41.jpg)
![Page 42: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/42.jpg)
![Page 43: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/43.jpg)
![Page 44: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/44.jpg)
![Page 45: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/45.jpg)
Building A SRE Competency
![Page 46: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/46.jpg)
Get Management Buy-In
![Page 47: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/47.jpg)
SRE Won’t Work Without Two Things
● Authority to stop releases when the error budget has been
exhausted
● Authority to overflow operational work to the dev team
when operational load > 50%
This must be given from lead of engineering/product efforts.
DO NOT CONTINUE UNLESS YOU HAVE THESE!
![Page 48: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/48.jpg)
How Do You Get Buy-In?
![Page 49: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/49.jpg)
Establish a Sense of Urgency!
https://events.drupal.org/baltimore2017/sessions/%C2%A1viva-la-revoluci%C3%B3n-how-
start-devops-transformation-your-workplace
![Page 50: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/50.jpg)
Automatically Measure Toil
![Page 51: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/51.jpg)
SRE Operational Load Dashboard
![Page 52: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/52.jpg)
Operational Responsibility Assessment
![Page 53: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/53.jpg)
Operational Responsibility Assessment● Based on the Capability Maturity Model (https://en.wikipedia.org/wiki/Capability_Maturity_Model)
● Evaluates the following responsibilities:
○ Routine Tasks
○ Emergency Response
○ Monitoring and Metrics
○ Capacity Planning
○ Change Management
○ New Product Introduction and Removal
○ Service Deploy and Decommissioning
○ Performance and Efficiency
○ Information Security
![Page 54: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/54.jpg)
Operational Responsibility Assessment
Each responsibility is scored from 1-5:
1. Initial: Chaotic. Undocumented, ad-hoc, and require individual heroics.
2. Repeatable: Documented sufficiently so they can be repeated with the same
results.
3. Defined: Roles and responsibilities for the process are defined and
confirmed.
4. Managed: The process is quantitatively managed in accordance with agreed-
upon metrics.
5. Optimizing: Process management includes deliberate process
optimization/improvement.
![Page 55: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/55.jpg)
![Page 56: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/56.jpg)
Operational Responsibility Assessment
● Assess your services often! (we suggest quarterly)
● Take findings/risks and create tasks for improvement
● Publish your results and share them with your organization
● Do not tie ORA results to KPIs, incentives, etc
![Page 57: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/57.jpg)
READ APPENDIX A!
![Page 58: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/58.jpg)
Blameless Post Mortems
![Page 59: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/59.jpg)
Blameless Post Mortems● Document timeline of the incident
● With the team, determine:
○ What went well
○ What didn’t go well (process failures, technical root cause)
○ What was lucky (or circumstantial)
● For each thing that didn’t go well or was circumstantial:
○ File an action item to address it
○ Make sure they have clear acceptance criteria/requirements (grooming)
○ Make sure they have a clear level of effort (sizing)
○ Prioritize in the backlog based on relative risk
● Openly share the post-mortem with the rest of the company
● Review with the team periodically
![Page 60: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/60.jpg)
Launch Readiness Criteria
![Page 61: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/61.jpg)
What is Launch Readiness Criteria?
● A set of guidelines that represent the minimum standard of what a new
product launch requires from an operational standpoint
● Expressed in terms of the Operational Responsibility Assessment
● Intended to address the major forms of risk without introducing needless
roadblocks into the product launch process
● A living document that is continuously maintained and kept relevant
● Inspired by: https://landing.google.com/sre/book/chapters/reliable-product-
launches.html
![Page 62: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/62.jpg)
Example LRC Checklist Items
![Page 63: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/63.jpg)
LRC Enablement
![Page 64: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/64.jpg)
Example Service Pages
![Page 65: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/65.jpg)
Example Service Dashboard
![Page 66: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/66.jpg)
Example Code
![Page 67: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/67.jpg)
Example Operational Runbooks
![Page 68: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/68.jpg)
Example Post Mortem/RCA Template
![Page 69: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/69.jpg)
Create an Onboarding Process
![Page 70: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/70.jpg)
Create an Onboarding Process
● Implement an Incident Response Process
○ On-Call Rotation
○ Documentation for stakeholders on how to get help
○ Fundamentals: production access credentials, runbooks
● Perform/Publish an Operational Responsibility Assessment
● Define/Publish Service Level Objectives
● Create Monitoring/Alerting against SLOs
● Create Dashboards For SLO performance and remaining error budget
![Page 71: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/71.jpg)
Weekly Office Hours
![Page 72: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/72.jpg)
How To Hire SREs?
![Page 73: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/73.jpg)
Hire Software Developers
![Page 74: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/74.jpg)
Hire Software Developers
![Page 75: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/75.jpg)
Hire Operations People
![Page 76: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/76.jpg)
Hire Operations People
![Page 77: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/77.jpg)
What Makes a Good SRE?
● It’s complicated
● You want someone with the ability to contribute to a software engineering
project..
● Yet is motivated by operational concerns and understands the subject matter
(Linux, TCP/IP, monitoring, performance, config management..)
● Is willing to be on-call
● Knowledge of agile practices as a method to suggest improvements
● ‘SRE Temperament’: can communicate their opinions on something in a way
that is persuasive and data-driven
![Page 78: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/78.jpg)
Selling Points for Prospective SREs
● Toil capped at 50%, that means 50%+ project work at all times!
● Authority to stop flow of releases when service is too unreliable
● There is oncall, but responsibility is shared with the whole team
● Root causes of outages are tracked, prioritized, and addressed
These Create A Work Environment That Respects The SRE
![Page 79: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/79.jpg)
1 Year Retrospective
![Page 80: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/80.jpg)
What Went Well
![Page 81: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/81.jpg)
What Went Well
● Launch Readiness Criteria is now a corporate standard
● Teams are independently performing their own blameless post mortems
● Teams are independently performing their own ORAs
● SRE influenced a grassroots reorg of Cloud Engineering around SOA
● More and more teams are taking an active role in on-call responsibilities
● Weekly Office Hours has been an effective tool for sharing ideas
![Page 82: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/82.jpg)
What Didn’t Go Well
![Page 83: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/83.jpg)
What Didn’t Go Well
● We struggled with getting SLOs and error budgets established for all services
● We didn’t get Launch Readiness out the door fast enough for new services
![Page 84: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/84.jpg)
Current Improvements
![Page 85: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/85.jpg)
Current Improvements
● SRE engagements now require the onboarding process before any other
work can take place:
○ Establish Incident Response Process
○ Perform Operational Responsibility Assessment
○ Defining Service Level Objectives
○ Establishing Monitoring and Alerting Against SLOs
○ Create Dashboards Displaying SLOs and Error Budgets
● Operational Stories are required to be prioritized proportional to the SRE
presence on an engineering team.
![Page 86: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/86.jpg)
“When we were in Ops, it was simple, because our purpose was to simply address the incident.
Our purpose now is to address the problems of the business.
We are the vehicle of change. That’s hard work, but we can do it.”
![Page 87: A Crash Course in Building Site Reliability](https://reader033.vdocuments.site/reader033/viewer/2022051710/5a649b1e7f8b9a2c568b6357/html5/thumbnails/87.jpg)
Questions?