how to “sre” an sre training program deploying sre
TRANSCRIPT
![Page 1: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/1.jpg)
SRE
Deploying SRE Training Best Practices to Production
How to “SRE” an SRE Training Program
Jennifer Petoff (aka Dr. J)Twitter: @jennski
Senior Program Manager and Global Lead, SRE EDU
JC van WinkelSite Reliability Engineer and Lead Educator, SRE EDU
![Page 2: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/2.jpg)
Jennifer Petoff (aka Dr. J)
Google Ireland
● Ph.D. in Chemistry
● 12 years at Google
● Co-editor of the SRE Book
● Part-time Travel Blogger at Sidewalk Safari
![Page 3: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/3.jpg)
JC van Winkel
Google Switzerland
● 8 years at Google
● Was oncall for production monitoring at Google for 6 years
● 30 years experience in teaching
![Page 4: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/4.jpg)
Why is training important?
![Page 5: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/5.jpg)
![Page 6: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/6.jpg)
![Page 7: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/7.jpg)
SRE EDU: A Brief History
Google SRE Founded
2003 Grokking SRE The Hard Way.. ..2014..
‘Secret Alliance’ for SRE
Education convenes
SRE EDU Team
Formed
..2015.
...2016..
v1 SRE EDU Orientation Launched Going
Oncall Curriculum Launched
..2017.
...2019..
SRE EDU Ongoing
Education Week
Launched
v2 SRE EDU Orientation Launched
Focus on Operations, Automation, Toil Reduction
(SRE’ing our SRE Training Programs)
..2018..
![Page 8: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/8.jpg)
Continuum of Training Options
Low high
“Sink or Swim”
Self-study curriculum
Buddy System
Ad hoc classes
Systematic Training Program
Effort
● Avoid “Sink or Swim” if you value inclusivity. ○ Breeds stress, frustration, attrition○ Imposter syndrome
● For other options, consider the ROI on the effort invested○ Are you a small or large organization?○ Are you growing rapidly?○ How experienced are the people you are trying to train?
![Page 9: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/9.jpg)
Is More Effort Always Better? No.
Effort Results
SRE Principle in Practice:
● Do just enough to meet the needs of your students.
● Keep them happy, but not too happy.
● Consider trade-offs and avoid polishing a diamond.
![Page 10: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/10.jpg)
“What” “How”
Software Development
Product Features
Deploying to production in a reliable way to meet the needs of our users.
Training Program
Training Content
Deploying a consistent and reliable training program that meets the needs of our students.
Analogy Between Software Development and Training Programs
![Page 11: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/11.jpg)
https://landing.google.com/sre/sre-book/chapters/part3/
Service Reliability Hierarchy
Monitoring
Incident Response
Postmortem / RCA
Testing & Release
Development
Product
Capacity Planning
![Page 12: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/12.jpg)
Service Reliability Hierarchy* SRE Training Reliability Hierarchy
Monitoring
Incident Response
Postmortem / RCA
Testing & Release
Development
Product
Capacity Planning
Postmortem / RCA
Curriculum design
Program
Scale Operations
Address Issues
Attendance Tracking / Survey Feedback
Test Teaching
* https://landing.google.com/sre/sre-book/chapters/part3/
How to Apply SRE Principles to a Training Program
![Page 13: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/13.jpg)
Service Reliability Hierarchy* SRE Training Reliability Hierarchy
Monitoring
Incident Response
Postmortem / RCA
Testing & Release
Development
Product
Capacity Planning
Postmortem / RCA
Curriculum design
Program
Scale Operations
Address Issues
Attendance Tracking / Survey Feedback
Test Teaching
* https://landing.google.com/sre/sre-book/chapters/part3/
How to Apply SRE Principles to a Training Program
![Page 14: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/14.jpg)
Service Reliability Hierarchy* SRE Training Reliability Hierarchy
Monitoring
Incident Response
Postmortem / RCA
Testing & Release
Development
Product
Capacity Planning
Postmortem / RCA
Curriculum design
Program
Scale Operations
Address Issues
Attendance Tracking / Survey Feedback
Test Teaching
* https://landing.google.com/sre/sre-book/chapters/part3/
How to Apply SRE Principles to a Training Program
![Page 15: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/15.jpg)
Service Reliability Hierarchy* SRE Training Reliability Hierarchy
Monitoring
Incident Response
Postmortem / RCA
Testing & Release
Development
Product
Capacity Planning
Postmortem / RCA
Curriculum design
Program
Scale Operations
Address Issues
Attendance Tracking / Survey Feedback
Test Teaching
* https://landing.google.com/sre/sre-book/chapters/part3/
How to Apply SRE Principles to a Training Program
![Page 16: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/16.jpg)
Service Reliability Hierarchy* SRE Training Reliability Hierarchy
Monitoring
Incident Response
Postmortem / RCA
Testing & Release
Development
Product
Capacity Planning
Postmortem / RCA
Curriculum design
Program
Scale Operations
Address Issues
Attendance Tracking / Survey Feedback
Test Teaching
* https://landing.google.com/sre/sre-book/chapters/part3/
How to Apply SRE Principles to a Training Program
![Page 17: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/17.jpg)
Service Reliability Hierarchy* SRE Training Reliability Hierarchy
Monitoring
Incident Response
Postmortem / RCA
Testing & Release
Development
Product
Capacity Planning
Postmortem / RCA
Curriculum design
Program
Scale Operations
Address Issues
Attendance Tracking / Survey Feedback
Test Teaching
* https://landing.google.com/sre/sre-book/chapters/part3/
How to Apply SRE Principles to a Training Program
![Page 18: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/18.jpg)
Service Reliability Hierarchy* SRE Training Reliability Hierarchy
Monitoring
Incident Response
Postmortem / RCA
Testing & Release
Development
Product
Capacity Planning
Postmortem / RCA
Curriculum design
Program
Scale Operations
Address Issues
Attendance Tracking / Survey Feedback
Test Teaching
* https://landing.google.com/sre/sre-book/chapters/part3/
How to Apply SRE Principles to a Training Program
![Page 19: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/19.jpg)
Service Reliability Hierarchy* SRE Training Reliability Hierarchy
Monitoring
Incident Response
Postmortem / RCA
Testing & Release
Development
Product
Capacity Planning
Postmortem / RCA
Curriculum design
Program
Scale Operations
Address Issues
Attendance Tracking / Survey Feedback
Test Teaching
* https://landing.google.com/sre/sre-book/chapters/part3/
How to Apply SRE Principles to a Training Program
![Page 20: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/20.jpg)
Also more prepared, hands-on "Hello world" demonstrations and in-class labs allowing use of the aforementioned paths would be welcome (kinesthetic).
More time doing hands-on work and deeper exploration of how {redacted} were run by SRE teams would be nice.
Some more hands-on activities would have been good.
I disliked the "wall of lecture" in some classes, meaning 1.5 or 2 hours of listening with little/no hands-on exercise.
What Did Our Monitoring Tell Us?
![Page 21: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/21.jpg)
Main Goal of SRE EDU Onboarding
● Instill confidence and convey SRE Culture
○ Teach just enough tech and tools to be able to navigate our troubleshooting exercises
○ Understand it is OK to ask questions or escalate
![Page 22: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/22.jpg)
Introduction to SRE EDU Orientation v2
● Move away from passive listening
● Instill confidence
● Troubleshoot a real system, built for this purpose
● Facilitator backs off more and more
● Groups of three students, least experienced in the middle, driving
![Page 23: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/23.jpg)
● Tangible
● ‘Real world’ applicable
● Distributed
● Applying best practices
● Application feels alive
● Breakable
"Sollbruchstelle"(predetermined breaking point)
v2 Application Requirements
![Page 24: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/24.jpg)
Typical reaction to the training experience...
![Page 25: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/25.jpg)
Design and Development Challenges
● Cannot "just" build it: follow best practices
● Use frameworks that guarantee best practices
● We need more than 1 instance
● Spoilers
● Development cycles…
![Page 26: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/26.jpg)
SRE EDU Orientation Was Built with Volunteers
Knowledge about distributed systems is distributed.
Flexible workforce.
It takes longer.
“Day job” can get in the way.
Flex skills
Recognition
Pros Cons “WIIFM”
![Page 27: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/27.jpg)
The “Product”
![Page 28: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/28.jpg)
Architecture of the “Product”
LogicServerUiServer
Spanner
Cloud storagedrop_zone
Cloud storage long_term
Uploader
Google Generic Image service
Cleanup Pipeline
User
Student
Operator CLI
ProdNet
proxy
![Page 29: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/29.jpg)
● As much automation as possible
● Names of new hires and interested people are added to the SRE EDU list
● Automatically assigned classes and give proper production permissions
● Instructor automation
● Breakage automation...
How Does v2 Work in Practice
![Page 30: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/30.jpg)
Automation of Breakages
● Breakages are enabled automatically based on class schedule (calendar)
● SRE EDU oncaller is paged if breakage is *not* eating out of SLO fast enough
● Facilitator removes a silence when phones must page
● Students use the normal Google internal tools and have full rights
![Page 31: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/31.jpg)
What Does Our Monitoring Tell Us Now?
SRE EDU Orientation V2
● 97% Net Promoter Score (+7 pp vs v1)
● +26 pp increase in ‘Very Likely to Recommend’
● 87% of respondents report 1+ increase in confidence (+14 pp vs v1)
● Positive shift in histogram of Δ self-reported confidence
Δ self-reported confidenceHow likely to recommend?
![Page 32: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/32.jpg)
I went in feeling quite apprehensive & came out feeling like I at least know which way I'm pointed. Thoroughly enjoyed the breakage activities and learning about how Google's infra, monitoring and processes fit together.
Delving into real breaking scenarios was super valuable - I would love more of these (1 per day would be amazing).
The breakage scenarios in SRE EDU were awesome.
It was the funnest week I've had this year. Overall, it made me feel more connected to production and the technology, which made me really happy.
What Does Our Monitoring Tell Us Now?
![Page 33: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/33.jpg)
SRE EDU Orientation v2 is Better Instrumented for Observability
Concrete behaviors demonstrated
● Use a system diagram
● Diagnose issues using SRE tools
● Annotate an outage
● Mitigate a realistic production issue
● Find root cause & propose a solution
![Page 34: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/34.jpg)
SRE Training: Adapting for Small Companies
● Probably no classes, but self directed and hands on exercises
● Hands on in an environment that looks like a production environment
● Have a script that breaks things
● Plausible story for breakage
![Page 35: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/35.jpg)
Instructional Design Principles for Orgs of All Sizes
● Know your audience
● Consider your culture
● Tell stories
● Define learning objectives
● Use a model for instructional design
i.e. ADDIE
SRE Training Reliability Hierarchy
Postmortem / RCA
Curriculum design
Program
Scale Operations
Address Issues
Attendance Tracking / Survey Feedback
Test Teaching
![Page 36: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/36.jpg)
SRE Training Takeaways
● Training SREs is about building confidence and reducing imposter syndrome not about a fire hose of information
![Page 37: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/37.jpg)
SRE Training Takeaways
● Training SREs is about building confidence and reducing imposter syndrome not about a fire hose of information
● Hands on exercises → more confidence
● The Service Reliability Hierarchy provides a useful framework for building and running an SRE training program.
![Page 38: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/38.jpg)
SRE Training Takeaways
● Training SREs is about building confidence and reducing imposter syndrome not about a fire hose of information
● Hands on exercises → more confidence
● The Service Reliability Hierarchy provides a useful framework for building and running an SRE training program.
![Page 39: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/39.jpg)
SRE Training Takeaways
● Training SREs is about building confidence and reducing imposter syndrome not about a fire hose of information
● Hands on exercises → more confidence
● The Service Reliability Hierarchy provides a useful framework for building and running an SRE training program.
![Page 40: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/40.jpg)
Final Words...
ASSBAT
![Page 41: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/41.jpg)
Brad Lipinski
SRE, Software Engineer
Jennifer Petoff
Global Program Mgr & Lead
David Butts
SRE, Software Engineer
JC van Winkel
Lead Educator
Preston Yoshioka
Instructional Designer
Laura Baum
Program Manager
Benjamin Weaver
Program Mgr
Thanks to the SRE EDU Core Team and All Our Volunteers!
![Page 42: How to “SRE” an SRE Training Program Deploying SRE](https://reader031.vdocuments.site/reader031/viewer/2022020702/61fab469d6828136613aa0f4/html5/thumbnails/42.jpg)
SRE
Q & A
Jennifer Petoff (aka Dr. J)Twitter: @jennski
Senior Program Manager and Global Lead, SRE EDU
JC van WinkelSite Reliability Engineer and Lead Educator, SRE EDU