discussion findings - the observability conference · observability & tracing summary what did...

Discussion findings Lightning talks presenting ourOSSD (Open Space-Style Discussion) Sessions

o11ycon 2018

o11ycon THE OBSERVABILITY CONFERENCE San Francisco | August 2, 2018 | https://twitter.com/o11ycon | https://o11ycon.io

Thanks to the amazing attendees who came to a conference and were told in the morning that not only did we have to carve out what our industry means when we talk about observability, but also throw together a slide deck about it by the end of the day. This is what we made together. ✨

https://twitter.com/o11ycon

https://o11ycon.io

https://emojipedia.org/sparkles/

o11y for non-engineers/non-expertsSummary What did we talk about? Why is this topic interesting?

The behavior of systems affects the business and other people in the business are going to have Questions about observing its behavior.

Interesting takeaways Compelling conclusions! Surprising realizations! Relatable tips!

Non-experts can be split into 3 rough groupings:- Business leadership- Operations: eg. product, customer support- Technical peers

Call to action How we’ll apply our findings, change our behaviors, etc

Different groups need different approaches:- Make it easier for business leadership to make decisions about the business. Direct attention to what’s important.- Dialogue is key with operations. Internal tools need product, design, and UX love too.- Make it easier for non-expert engineers to become expert engineers. Tools should be easy to get to: more summaries and handy links, less

giant documents.

o11ycon slack discussion channel: #o11y_for_nonengineersLink to notes: https://goo.gl/CbrFX4, https://goo.gl/e4Wi4Q

https://goo.gl/CbrFX4

https://goo.gl/e4Wi4Q

Testing in ProdSummary What did we talk about? Why is this topic interesting?

- Not testing in prod can be costly - it is often too costly to setup a staging environment that mimics production- When should we test in production? Can we live with this happening in production? What is the cost of failure?- Exploratory learning vs validating a hypothesis- Engineers are responsible for what they ship - Ways to test in production (feature flagging, run an experiment to compare results, canary release chaos engineering)


- Deploy, observe, detect, respond: be prepared for what happens after release - what metrics, monitor any side effects, how to rollback, measure success.

- The difference between monitoring and exploration. Can we combine chaos engineering and observability?


- Release agnostic testing - treat the release as just a stage in the middle of testing

-o11ycon slack discussion channel: #testing_in_prodLink to notes: https://goo.gl/es6BwV

https://goo.gl/es6BwV

Testing in Production

(Or why mistake are inevitable)

Making Observability Easier to ImplementSummaryHow do you observe a monolith?

- When does the observability happen? When’s an observable event? When do you log or sample a thing? Every request? Granularity?

- What’s in an event?- How to determine an event ex post facto?- Where do events go? Does it matter if you drop things?

- How do we add events to our app? Literally how do we do this

Call to action

Observability is a process!(it is not an end state)

o11ycon slack discussion channel: #5-implementing-o11yLink to notes: https://goo.gl/x2JKgB

https://goo.gl/x2JKgB

Tying Observability to Business GoalsSummary What did we talk about? Why is this topic interesting?

● How do we measure the effectiveness and cost / benefit of o11y work? How do we justify the cost? How do we communicate that throughout the org?

● Creating the chain that connects engineering to business goals changes the adversarial relationship of product vs. technical debt into a data driven business enabled partnership

Interesting takeaways and calls to action Compelling conclusions! Surprising realizations! Relatable tips!

● Using the same source data to drive both engineering-oriented KPIs and business-decision OKRs is a good way to help drive o11y investments

● Having customer impact stories in your back pocket makes it easier to talk about why investing in o11y is important● Maintain a regular cadence “State of the Site” review meeting and make friends with support to understand when your

KPI doesn’t reflect reality (9s don’t matter when users aren’t happy)● Having a technical debt budget is an effective way to carve out space to implement o11y in the face of product

demands

o11ycon slack discussion channel: #tying_o11y_2_bizgoalsLink to notes: https://goo.gl/y2m8Y2

https://goo.gl/y2m8Y2

Observability & TracingSummary What did we talk about? Why is this topic interesting?

● Traces vs. Events● Aggregated traces & how can ML be applied to traces● Getting out of the pattern of one tracing expert per company● Tools, automatic vs. manual, sampling


● Tracing is the ability to solve complex problems. More than a single bit of data, includes context● 1 Tracing expert per company is due to: expertise is limited to a few individuals, the learning curve is very steep● Increase sampling to a point where you see a number of interesting events.

○ Determine “interesting events” by finding anomalies


● Decide what business metrics to care about. Choose the right tools. Implement.● Invest time, don’t rush into a solution.

o11ycon slack discussion channel: #o11y_and_tracingNotes: https://goo.gl/qRKTDy, https://goo.gl/1uMtBM

https://goo.gl/qRKTDy

https://goo.gl/1uMtBM

Observability & On-CallSummary What did we talk about? Why is this topic interesting?

● Be intentional rather than hoping for the best: make sure on call people feel supported and enabled, invest in tools (surface useful stuff) and teams (have backup/secondary responders)

● Make sure the person responding is the best person to respond and when it’s not, figure out why and fix it● On-call is always evolving: from one person always being on call to just one team to multiple teams to ...


● Trading on-call for “crunch time”: operations people don’t necessarily have to suffer while product people have to pull all-nighters to finish a feature/release.

● Chaos training: try to arrange to respond to simulated or real incidents during the day when everyone is around


What would it take to get to only having to respond to work during working hours? Architect and implement things that move your stack in that direction.

o11ycon slack discussion channel: #o11y-and-oncallLink to notes:

Creating a culture of O11y“o11y is more than tooling, it's a culture”

Summary What did we talk about? Why is this topic interesting?

● What does O11y Culture look like, what values drive it● What challenges lie ahead and what take away actions do we have.


● Many similar culture characteristics to Lean/DevOps/DevSecOps/High Perf… “Be A Learning Organization”● Values - Trust, Empathy, Curiosity - Need to prove out to leadership to create incentives that will drive behaviour● If you don’t have a learning org / data driven leadership … solve that first! ( or Quit Your Job* )


● Shift Observability Left - Create tests for observability hygiene in CI ( peer review, unit, integration, etc )● Include an “Observer” persona in design / user stories● Get some quick wins and Create Champions/Advocates at Leadership and Peer levels

o11ycon slack discussion channel: #o11y_culture_changeLink to notes: https://docs.google.com/document/d/1RYT5N3LtF4myswFpICfQUql2adbtIhjwem9E-AhPU9c/edit?ts=5b6382ac

Observability-Driven DevelopmentSummary What did we talk about? Why is this topic interesting?

How can we make decisions about what to work on, by using Observability principles? How does this work in different kinds of systems (IT, cross-organizational, human)


There is no 100% observable system (similar to how it’s impossible to get 100% Test coverage)Minimum to observe/track is entry & exit points within and between systemsIt’s hardest to go from 0 to 1 than subsequent stepsWe want to be more proactive, but being reactive isn’t bad unless it’s 100% Emergency timeObservations are highly context-dependent, based on data, what you’re trying to accomplish, and how people interpret “an observable event”.


Plan the things to observer based on getting from 0-1. Observations that cross systems are especially useful.

o11ycon slack discussion channel: #o11y-driven-devLink to notes:

DiscoverySummaryDiscovery’s place in the benevolent cycle of Suffering Driven Development

Interesting takeaways

Approaching discovery as inputs and outputs (and inputs)Inputs: onboarding, proximity, data model, 80% documentation, push standardsDashboards: useful vs eye candy, reports, trends & predictionAlerts: incidents / bad things (boo) & KPI’s / good things (yay)Feedback = Decrease in Suffering

Call to action

Automation & Factory Settings

Cross-Technology Visualization (cloud, apps, services, code, ci, developer)

Opportunities for Orchestration

o11ycon slack discussion channel: #generalLink to notes:

Observability and MonitoringSummary What did we talk about? Why is this topic interesting?

● Definition of Monitoring: Tracking, collecting data, watching. Monitoring is a action/verb

● Definition of O11y: Explorable system state, similar to Visibility, Availability, Reliability Observability is a quality

● What metrics do we use for observability?


● Does O11y requires instrumentation? Is instrumentation just the tech behind observability

● What role is left old fashioned metrics & aggregates? To CEO type people: convert instrumentation into spreadsheet of numbers into funding.

● Metrics represent a state of a thing: often indicators of the state of a thing. They are not outdated, they just have a different purpose

● Monitoring to Event Driven o11y


● Should there be an “observability index” of how observable a service is? What would 100% observability look like?● Timeliness is as important as volume of observation data… and there may be too much data. How do we find the right balance (for me)● Could GDPR become an issue when one is instrumenting/logging one’s software?

o11ycon slack discussion channel: #o11y_monitoringLink to notes: https://goo.gl/efuFec

https://goo.gl/efuFec

Onboarding with ObservabilitySummary What did we talk about? Why is this topic interesting?

How do we teach the systems that we build? How do we approach on-call onboarding to teach observation systems and runbooks?


● Documentation and runbooks are “always out of date” or “dangerously wrong” if not used and updated regularly● Somehow we teach other things just fine… how are they different?● Software is made of “Dark Matter” that is hard to observe without investment


1. Onboarding and learning is constant and ongoing, measure its success as part of engineering.2. Use the “buddy system” in onboarding and on-call. Scale up responsibility… responsibly.3. Documentation becomes stale when things change: create process and cadence to change it too4. Knowledge of a system is how alive it is. Treat it like a CDN. Build competency in teaching the oral tradition.5. Leave “breadcrumbs” everywhere. Link to tickets and Github and docs in everything. Increase contextual density.

o11ycon slack discussion channel: #onboarding_with_o11yLink to notes: https://goo.gl/TKEzp9

“Humans are great at telling stories,what is our story?”

https://goo.gl/TKEzp9

o11ycon slack discussion channel: #o11y_and_serverlessLink to notes: https://goo.gl/x69ogR

SEEING THE BIG PICTURE

LOSS OF CONTROL

On-call and #PagerLife

Summary What did we talk about? Why is this topic interesting?

Discussed the alert flow, incident team and communication channels.

Surfacing more of the details necessary to turn the alert into something more actionable: alert; theory; incident.


Alert fatigue is a real concernDashboard is a real concern

Being on call requires a mobile device and a incident-interrogation deviceInclude with alerts a context, a history of solutions (run book)Removing the friction of the operations to notify, organize and gather make a huge impact to the process


Considering a “mobile device” command center view will remove more friction from the process.

o11ycon slack discussion channel: #generalLink to notes:

Thank you presenters!

discussion findings - the observability conference · observability & tracing summary what did...

Documents