ibm center for cloud training

29
IBM Center for Cloud Training Webinar IBM Cloud Site Reliability Engineer (SRE) Certification September 28th, 2021

Upload: others

Post on 08-Nov-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: IBM Center for Cloud Training

IBM Center for Cloud Training

—Webinar

IBM Cloud Site Reliability Engineer (SRE) CertificationSeptember 28th, 2021

Page 2: IBM Center for Cloud Training

• Introduction

• Discussions around the following topics

2

You can use the chat

window to ask a question

Agenda

How has the role of SRE evolved?

How has the skills required for SRE changed?

What are the key topics for the IBM Cloud SRE Certification?

How can I prepare for the Certification?

Page 3: IBM Center for Cloud Training

Expert Perspective

3

Kevin YuPrincipal SRE - Site Reliability, Incident & Problem Management, and SRE KPI

John EastonDistinguished Engineer - Cloud Strategy & Business Development Engineering

Cale CoheeSRE - AI Applications Platform

Page 4: IBM Center for Cloud Training

4

Anti-Pattern

Saying buzz words like SRE won’t naturally achieve outcomes

SRE is not a one-time event

Page 5: IBM Center for Cloud Training

Evolving role of SRE

5

Development Team

“Developers”

Operation Team

“SysAdmins”conflicts

SREDevOpsI built it, I run it What happens when you

ask a software engineer to design an operations function

Typical view of “SRE”

SRE Budget & SLO

velocity vs. reliability

Page 6: IBM Center for Cloud Training

Evolving role of SRE

6

DCUT

Build/Regression

Test

architect

deploy

operate

monitor

release

Looking from lens of solution life cycle

SRETypical role SRE plays in

Page 7: IBM Center for Cloud Training

7

Design-Pattern

Support Site Reliability Engineers with prioritization and resources

SRE discipline is applied in the entire life cycle

Page 8: IBM Center for Cloud Training

DCUT

Build/Regression

Test

architect

deploy

operate

monitor

release

Understand user commitments / SLA, Empathy MappingFormulate measurable SLI and SLO meet commitments

1

2

Instrument code to measure SLI and KPIEarly PoC and validate implementation against SLOs

3

4

BVT measure SLI / KPI of new build vs. previous. Success within established SLOs.

5

Scalability test to understand capacity and elasticity towards meeting SLOapproximate # of nodes, and hardware required for projection and costDuration test to understand reliability over time. Consistent SLI and no upward trending KPIs and resources.Resiliency test to understand break point, visibility to disruptions and recovery

6

7

8

Validate new build against SLODark launch and Canary release to test and mitigate risks

9

Validate service against SLOMitigate disruptions based on SLI visibilityFocus on quick recover (MTTR) of failuresIdentify additional use cases that negatively impact SLI and exceed SLO and improve pipeline

12

13

Postmortem learning to surface and prioritize SRE tasks

11

start here and iterate

SRE RolesData Driven, KPI Focusedfeature delivery

10

14

15

Page 9: IBM Center for Cloud Training

9

The Tenets of SRE

Capture approaches to modernizing "the way we work" as we implement and provide services to our clients in a hybrid multi-cloud world.

Result in tangible and identifiable outcomes.

In Site Reliability Engineering these practices are often referred to as "The Tenets of SRE."

Page 10: IBM Center for Cloud Training

What should you know about IBM Cloud SRE certification?

10

• What is the process?• What are the variations?• What content are tested?

Page 11: IBM Center for Cloud Training

Certification Process

11

Job Task Analysis

Blueprint Survey

Question Writing

Technical Review

Angoff Scoring

Publish Exam

Page 12: IBM Center for Cloud Training

Associate vs. Professional SRE comparison

12

Page 13: IBM Center for Cloud Training

13

Sources of IBM Cloud SRE education materials

Page 14: IBM Center for Cloud Training

14

Page 15: IBM Center for Cloud Training

15

Page 16: IBM Center for Cloud Training

16

Page 17: IBM Center for Cloud Training

17

Page 18: IBM Center for Cloud Training

18

Page 19: IBM Center for Cloud Training

Experiences of a newly certified IBM Cloud SRE

19

• How did you study for it?• What are the types of questions?

Page 20: IBM Center for Cloud Training

20

Study Guide

Page 21: IBM Center for Cloud Training

21

Sample Exam

Page 22: IBM Center for Cloud Training

22

Assessment Exam

Page 23: IBM Center for Cloud Training

• Example 1

23

Certification Questions

A service owner is stating that their service needs to have an SLO of 99.99%. Why might the SRE team suggest a lower more appropriate target?

A. Meeting the 99.99% target is too hard to achieve B. Users can't tell the difference between very high levels of

availability anyway C. The target is something that should be exceeded but not by

too much to support the error budget D. This is a new service so the availability target shouldn't be that

high

Page 24: IBM Center for Cloud Training

• Example 1

24

Certification Questions

A service owner is stating that their service needs to have an SLO of 99.99%. Why might the SRE team suggest a lower more appropriate target?

A. Meeting the 99.99% target is too hard to achieve B. Users can't tell the difference between very high levels of

availability anyway C. The target is something that should be exceeded but not by

too much to support the error budget D. This is a new service so the availability target shouldn't be that

high

Page 25: IBM Center for Cloud Training

• Example 2

25

Certification Questions

What is a responsibility of the First Responder role in a Cloud Service Operation Management structure?

A. Responsible for identifying the root cause of the incident B. Responsible for receiving incident information and

collaborates with SMEs to restore services as fast as possible C. Responsible for overseeing the handling of a problem and to

bring it to closure D. Responsible for executing runbooks and working with subject

matter experts (SMEs) to restore the service

Page 26: IBM Center for Cloud Training

• Example 2

26

Certification Questions

What is a responsibility of the First Responder role in a Cloud Service Operation Management structure?

A. Responsible for identifying the root cause of the incident B. Responsible for receiving incident information and

collaborates with SMEs to restore services as fast as possible C. Responsible for overseeing the handling of a problem and to

bring it to closure D. Responsible for executing runbooks and working with

subject matter experts (SMEs) to restore the service

Page 27: IBM Center for Cloud Training

29

Page 28: IBM Center for Cloud Training

• Join the Expert TV for the latest news and hot topics in cloud training:• https://ibm.biz/ICCTonExpertTV

• Join the discussion forum and connect with IBM Cloud Experts & Curriculum Managers:• http://ibm.biz/cloudtrainingdiscussionforum

• Look for new IBM Cloud role-based certifications on the ICCT web page: • www.ibm.com/training/cloud/jobroles

Resources you can use

30

https://ibm.biz/BdfgnD

Event Devoted for: IT resilience, performance,security, quality testing and SRE

Page 29: IBM Center for Cloud Training

Q&A

31