site reliability engineering · site reliability engineering devops on steroids big techday 12,...
TRANSCRIPT
SiteReliabilityEngineeringSiteReliabilityEngineering
DevOpsonSteroidsDevOpsonSteroids
BigTechday12,2019-06-07
MaximilianBode
https://techcrunch.com/2016/03/02/are-site-reliability-engineers-the-next-data-scientists/
https://insights.stackoverflow.com/survey/2019
SiteReliabilityEngineeringSiteReliabilityEngineering
1.Grundlagen
2.Prinzipien
3.Praxis
GrundlagenGrundlagen
Source:http://turnoff.us/geek/devops-explained/
Source:https://landing.google.com/sre/books/
SiteReliabilityEngineering
BenTreynor,VPofEngineering,Google
Source:https://www.linkedin.com/in/benjamin-treynor-sloss-207120/
Fundamentally,it’swhathappenswhenyouaskasoftwareengineertodesignanoperations
function.
DevOpsundSREalsKonkurrenten?DevOpsundSREalsKonkurrenten?
class SRE implements DevOps{ }
WerpraktiziertSRE?WerpraktiziertSRE?
Apple
Evernote
Atlassian
TheHomeDepot
TheNewYorkTimes
undvielemehr…
PrinzipienPrinzipien
UmgangmitRisikoUmgangmitRisiko
Hopeisnotastrategy.
MetrikenalsBasisfürEntscheidungenMetrikenalsBasisfürEntscheidungen
ServiceLevelObjectivesServiceLevelObjectives
SLI
Indikator
SLO
Objective,Ziel
SLA
Agreement,Übereinkommen
ErrorBudgetsErrorBudgets
BalanceEntwicklungsgeschwindigkeit Zuverlässigkeit
AutomatisierungAutomatisierung
Eliminatingtoil
PraxisPraxis
Projekterfahrung
CloudCloud
InfrastructureasCodeInfrastructureasCode
Source:https://www.terraform.io/logos.html
resource "aws_lambda_function" "serverless_test" { filename = "my_code.zip" function_name = "lambda_function_name" role = "${aws_iam_role.iam_for_lambda.arn}" handler = "serverless.handler" source_code_hash = "${filebase64sha256("my_code.zip")}" runtime = "python3.7" }
ContainersContainers
Source:https://blog.docker.com/2013/06/announcing-new-docker-style/
Source:https://github.com/cncf/artwork
GitOpsGitOps
CI/CDCI/CD
Source:https://about.gitlab.com/press/press-kit/
MonitoringMonitoring
ThreePillarsofObservabilityThreePillarsofObservability
StructuredLogging
Metrics
Traces
MetrikenMetriken
FourGoldenSignalsFourGoldenSignals
Latenz
Traffic
Fehlerrate
Auslastung
MetrikenMetriken
PrometheusPrometheus
Source:https://en.wikipedia.org/wiki/File:Prometheus_software_logo.svg
DashboardsDashboards
AlarmeAlarme
- alert: FlinkJobsMissing expr: sum(flink_api_jobs_running) < 2 for: 3m annotations: summary: Fewer Flink jobs than expected are running.
TeamstrukturTeamstruktur
Ad-Hoc-Aufgabenvs.langfristigeVerbesserungen
OperationsManager
IncidentManagementIncidentManagement
IncidentPostmortemIncidentPostmortem
SchriftlicheAufzeichnungnachZwischenfall
Auswirkungen
Maßnahmen
RootCause
BlamelessKommunikation(intern&extern)
Wasnoch?Wasnoch?
Microservice-Architektur
ChaosEngineering
PsychologicalSafety
HandleyPageW.8,1919
16Passagiere
2Piloten
AirbusA380,2005
853Passagiere
2Piloten
"
��
MaximilianBode
@mxpbode
mbode