discussion findings - the observability conference · observability & tracing summary what did...

23
Discussion findings é Lightning talks presenting our OSSD (Open Space-Style Discussion) Sessions o11ycon 2018 o11ycon THE OBSERVABILITY CONFERENCE San Francisco | August 2, 2018 | https://twitter.com/o11ycon | https://o11ycon.io Thanks to the amazing attendees who came to a conference and were told in the morning that not only did we have to carve out what our industry means when we talk about observability, but also throw together a slide deck about it by the end of the day. This is what we made together.

Upload: others

Post on 22-May-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Discussion findings Lightning talks presenting ourOSSD (Open Space-Style Discussion) Sessions

o11ycon 2018

o11ycon THE OBSERVABILITY CONFERENCE San Francisco | August 2, 2018 | https://twitter.com/o11ycon | https://o11ycon.io

Thanks to the amazing attendees who came to a conference and were told in the morning that not only did we have to carve out what our industry means when we talk about observability, but also throw together a slide deck about it by the end of the day. This is what we made together. ✨

Page 2: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

o11y for non-engineers/non-expertsSummary What did we talk about? Why is this topic interesting?

The behavior of systems affects the business and other people in the business are going to have Questions about observing its behavior.

Interesting takeaways Compelling conclusions! Surprising realizations! Relatable tips!

Non-experts can be split into 3 rough groupings:- Business leadership- Operations: eg. product, customer support- Technical peers

Call to action How we’ll apply our findings, change our behaviors, etc

Different groups need different approaches:- Make it easier for business leadership to make decisions about the business. Direct attention to what’s important.- Dialogue is key with operations. Internal tools need product, design, and UX love too.- Make it easier for non-expert engineers to become expert engineers. Tools should be easy to get to: more summaries and handy links, less

giant documents.

o11ycon slack discussion channel: #o11y_for_nonengineersLink to notes: https://goo.gl/CbrFX4, https://goo.gl/e4Wi4Q

Page 3: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Testing in ProdSummary What did we talk about? Why is this topic interesting?

- Not testing in prod can be costly - it is often too costly to setup a staging environment that mimics production- When should we test in production? Can we live with this happening in production? What is the cost of failure?- Exploratory learning vs validating a hypothesis- Engineers are responsible for what they ship - Ways to test in production (feature flagging, run an experiment to compare results, canary release chaos engineering)

Interesting takeaways Compelling conclusions! Surprising realizations! Relatable tips!

- Deploy, observe, detect, respond: be prepared for what happens after release - what metrics, monitor any side effects, how to rollback, measure success.

- The difference between monitoring and exploration. Can we combine chaos engineering and observability?

Call to action How we’ll apply our findings, change our behaviors, etc

- Release agnostic testing - treat the release as just a stage in the middle of testing

-o11ycon slack discussion channel: #testing_in_prodLink to notes: https://goo.gl/es6BwV

Page 4: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Testing in Production

(Or why mistake are inevitable)

Page 5: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &
Page 6: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Making Observability Easier to ImplementSummaryHow do you observe a monolith?

- When does the observability happen? When’s an observable event? When do you log or sample a thing? Every request? Granularity?

- What’s in an event?- How to determine an event ex post facto?- Where do events go? Does it matter if you drop things?

- How do we add events to our app? Literally how do we do this

Call to action

Observability is a process!(it is not an end state)

o11ycon slack discussion channel: #5-implementing-o11yLink to notes: https://goo.gl/x2JKgB

Page 7: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Tying Observability to Business GoalsSummary What did we talk about? Why is this topic interesting?

● How do we measure the effectiveness and cost / benefit of o11y work? How do we justify the cost? How do we communicate that throughout the org?

● Creating the chain that connects engineering to business goals changes the adversarial relationship of product vs. technical debt into a data driven business enabled partnership

Interesting takeaways and calls to action Compelling conclusions! Surprising realizations! Relatable tips!

● Using the same source data to drive both engineering-oriented KPIs and business-decision OKRs is a good way to help drive o11y investments

● Having customer impact stories in your back pocket makes it easier to talk about why investing in o11y is important● Maintain a regular cadence “State of the Site” review meeting and make friends with support to understand when your

KPI doesn’t reflect reality (9s don’t matter when users aren’t happy)● Having a technical debt budget is an effective way to carve out space to implement o11y in the face of product

demands

o11ycon slack discussion channel: #tying_o11y_2_bizgoalsLink to notes: https://goo.gl/y2m8Y2

Page 8: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Observability & TracingSummary What did we talk about? Why is this topic interesting?

● Traces vs. Events● Aggregated traces & how can ML be applied to traces● Getting out of the pattern of one tracing expert per company● Tools, automatic vs. manual, sampling

Interesting takeaways Compelling conclusions! Surprising realizations! Relatable tips!

● Tracing is the ability to solve complex problems. More than a single bit of data, includes context● 1 Tracing expert per company is due to: expertise is limited to a few individuals, the learning curve is very steep● Increase sampling to a point where you see a number of interesting events.

○ Determine “interesting events” by finding anomalies

Call to action How we’ll apply our findings, change our behaviors, etc

● Decide what business metrics to care about. Choose the right tools. Implement.● Invest time, don’t rush into a solution.

o11ycon slack discussion channel: #o11y_and_tracingNotes: https://goo.gl/qRKTDy, https://goo.gl/1uMtBM

Page 9: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Observability & On-CallSummary What did we talk about? Why is this topic interesting?

● Be intentional rather than hoping for the best: make sure on call people feel supported and enabled, invest in tools (surface useful stuff) and teams (have backup/secondary responders)

● Make sure the person responding is the best person to respond and when it’s not, figure out why and fix it● On-call is always evolving: from one person always being on call to just one team to multiple teams to ...

Interesting takeaways Compelling conclusions! Surprising realizations! Relatable tips!

● Trading on-call for “crunch time”: operations people don’t necessarily have to suffer while product people have to pull all-nighters to finish a feature/release.

● Chaos training: try to arrange to respond to simulated or real incidents during the day when everyone is around

Call to action How we’ll apply our findings, change our behaviors, etc

What would it take to get to only having to respond to work during working hours? Architect and implement things that move your stack in that direction.

o11ycon slack discussion channel: #o11y-and-oncallLink to notes:

Page 10: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Creating a culture of O11y“o11y is more than tooling, it's a culture”

Summary What did we talk about? Why is this topic interesting?

● What does O11y Culture look like, what values drive it● What challenges lie ahead and what take away actions do we have.

Interesting takeaways Compelling conclusions! Surprising realizations! Relatable tips!

● Many similar culture characteristics to Lean/DevOps/DevSecOps/High Perf… “Be A Learning Organization”● Values - Trust, Empathy, Curiosity - Need to prove out to leadership to create incentives that will drive behaviour● If you don’t have a learning org / data driven leadership … solve that first! ( or Quit Your Job* )

Call to action How we’ll apply our findings, change our behaviors, etc

● Shift Observability Left - Create tests for observability hygiene in CI ( peer review, unit, integration, etc )● Include an “Observer” persona in design / user stories● Get some quick wins and Create Champions/Advocates at Leadership and Peer levels

o11ycon slack discussion channel: #o11y_culture_changeLink to notes: https://docs.google.com/document/d/1RYT5N3LtF4myswFpICfQUql2adbtIhjwem9E-AhPU9c/edit?ts=5b6382ac

Page 11: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Observability-Driven DevelopmentSummary What did we talk about? Why is this topic interesting?

How can we make decisions about what to work on, by using Observability principles? How does this work in different kinds of systems (IT, cross-organizational, human)

Interesting takeaways Compelling conclusions! Surprising realizations! Relatable tips!

There is no 100% observable system (similar to how it’s impossible to get 100% Test coverage)Minimum to observe/track is entry & exit points within and between systemsIt’s hardest to go from 0 to 1 than subsequent stepsWe want to be more proactive, but being reactive isn’t bad unless it’s 100% Emergency timeObservations are highly context-dependent, based on data, what you’re trying to accomplish, and how people interpret “an observable event”.

Call to action How we’ll apply our findings, change our behaviors, etc

Plan the things to observer based on getting from 0-1. Observations that cross systems are especially useful.

o11ycon slack discussion channel: #o11y-driven-devLink to notes:

Page 12: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

DiscoverySummaryDiscovery’s place in the benevolent cycle of Suffering Driven Development

Interesting takeaways

Approaching discovery as inputs and outputs (and inputs)Inputs: onboarding, proximity, data model, 80% documentation, push standardsDashboards: useful vs eye candy, reports, trends & predictionAlerts: incidents / bad things (boo) & KPI’s / good things (yay)Feedback = Decrease in Suffering

Call to action

Automation & Factory Settings

Cross-Technology Visualization (cloud, apps, services, code, ci, developer)

Opportunities for Orchestration

o11ycon slack discussion channel: #generalLink to notes:

Page 13: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Observability and MonitoringSummary What did we talk about? Why is this topic interesting?

● Definition of Monitoring: Tracking, collecting data, watching. Monitoring is a action/verb

● Definition of O11y: Explorable system state, similar to Visibility, Availability, Reliability Observability is a quality

● What metrics do we use for observability?

Interesting takeaways Compelling conclusions! Surprising realizations! Relatable tips!

● Does O11y requires instrumentation? Is instrumentation just the tech behind observability

● What role is left old fashioned metrics & aggregates? To CEO type people: convert instrumentation into spreadsheet of numbers into funding.

● Metrics represent a state of a thing: often indicators of the state of a thing. They are not outdated, they just have a different purpose

● Monitoring to Event Driven o11y

Call to action How we’ll apply our findings, change our behaviors, etc

● Should there be an “observability index” of how observable a service is? What would 100% observability look like?● Timeliness is as important as volume of observation data… and there may be too much data. How do we find the right balance (for me)● Could GDPR become an issue when one is instrumenting/logging one’s software?

o11ycon slack discussion channel: #o11y_monitoringLink to notes: https://goo.gl/efuFec

Page 14: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Onboarding with ObservabilitySummary What did we talk about? Why is this topic interesting?

How do we teach the systems that we build? How do we approach on-call onboarding to teach observation systems and runbooks?

Interesting takeaways Compelling conclusions! Surprising realizations! Relatable tips!

● Documentation and runbooks are “always out of date” or “dangerously wrong” if not used and updated regularly● Somehow we teach other things just fine… how are they different?● Software is made of “Dark Matter” that is hard to observe without investment

Call to action How we’ll apply our findings, change our behaviors, etc

1. Onboarding and learning is constant and ongoing, measure its success as part of engineering.2. Use the “buddy system” in onboarding and on-call. Scale up responsibility… responsibly.3. Documentation becomes stale when things change: create process and cadence to change it too4. Knowledge of a system is how alive it is. Treat it like a CDN. Build competency in teaching the oral tradition.5. Leave “breadcrumbs” everywhere. Link to tickets and Github and docs in everything. Increase contextual density.

o11ycon slack discussion channel: #onboarding_with_o11yLink to notes: https://goo.gl/TKEzp9

“Humans are great at telling stories,what is our story?”

Page 15: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

o11ycon slack discussion channel: #o11y_and_serverlessLink to notes: https://goo.gl/x69ogR

Page 16: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

SEEING THE BIG PICTURE

Page 17: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &
Page 18: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

LOSS OF CONTROL

Page 19: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &
Page 20: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &
Page 21: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &
Page 22: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

On-call and #PagerLife

Summary What did we talk about? Why is this topic interesting?

Discussed the alert flow, incident team and communication channels.

Surfacing more of the details necessary to turn the alert into something more actionable: alert; theory; incident.

Interesting takeaways Compelling conclusions! Surprising realizations! Relatable tips!

Alert fatigue is a real concernDashboard is a real concern

Being on call requires a mobile device and a incident-interrogation deviceInclude with alerts a context, a history of solutions (run book)Removing the friction of the operations to notify, organize and gather make a huge impact to the process

Call to action How we’ll apply our findings, change our behaviors, etc

Considering a “mobile device” command center view will remove more friction from the process.

o11ycon slack discussion channel: #generalLink to notes:

Page 23: Discussion findings - The Observability Conference · Observability & Tracing Summary What did we talk about? Why is this topic interesting? Traces vs. Events Aggregated traces &

Thank you presenters!