@mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools...

64
Charity Majors @mipsytipsy Observability & Complex Systems What got you here won't get you there, and other terrifying true tales from the computing frontier

Upload: others

Post on 29-Sep-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Charity Majors@mipsytipsy

Observability & Complex Systems

What got you here won't get you there, and other terrifying true tales from the computing frontier

Page 2: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Charity Majors@mipsytipsy

Page 3: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

@mipsytipsy

engineer/cofounder/CEO

https://charity.wtf

“the only good diff is a red diff”

Page 4: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

software ownership over the full lifecycle of the codeon call

what it means to be a senior engineerwhy operations engineers are now product owners and player-coaches

and maybe there's an ops pendulum in there somewhere toothis has neat ramifications for "career progression for non managers"

observability, control theorythe so-called three pillars

why SWEs could never be on call with monitoring-era toolingwhy i started honeycomb

why christine started honeycombthe stunning range of maturity and ability of the teams we talk to ...

and it has almost •zero* to do with their mean en gingering ability. eye bulge.

"chaos engineering"you must be this tall to ride this ride. (are you?)

busin ess intelligence, aka why nothing we are doing is remotely newwhy tools create silos

the implications of democratizing access to dataparticularly for levels and career progressions

how deploys must changethe mis allocation of internal tooling energy away rom deploy software

why you need to test in prodwhy you need a canary (probably)when to know you need a canary

why you definitely need feature flags, no matter whattest doesn't mean what you think it means

the future of development is observavbility-observability-driven development. "O-D-D yeah YOU KNOW ME"

why we have to stop leaning on intuition and tribal knowledge before it is too late

A short list of topics to cover in this time:

Page 5: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

"chaos engineering"you must be this tall to ride this ride. (are you? how do you evaluate this?)

business intelligence, aka why nothing we are doing is remotely newwhy tools create silos

the implications of democratizing access to dataparticularly for levels and career progressions

how deploys must changethe mis allocation of internal tooling energy away rom deploy software

why you need to test in prodwhy you need a canary (probably)when to know you need a canary

why you definitely need feature flags, no matter whattest doesn't mean what you think it means

the future of development is observavbility-observability-driven development. "O-D-D yeah YOU KNOW ME"

why we have to stop leaning on intuition and tribal knowledge before it is too late

continued ...

Page 6: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

the future of development is observavbility-observability-driven development. "O-D-D yeah YOU KNOW ME"

why we have to stop leaning on intuition and tribal knowledge before it is too late

why AIOps is stupid and doomedwhy the team is your best source o wisdom

why wisdom is not truthwhy ops needs to learn about design principles stat

why vendors are rushing to coopt the observability message before you notice they don't actually fulfill the demands, and why this makes me Very Stabby

cont'd ... just a brief outline

Page 7: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

A short partial list of things I would like to touch on...

"chaos engineering"you must be this tall to ride this ride. (are you? how do you evaluate this?)

business intelligence, aka why nothing we are doing is remotely newwhy tools create silos

the implications of democratizing access to dataparticularly for levels and career progressions

how deploys must changethe mis allocation of internal tooling energy away rom deploy software

why you need to test in prodwhy you need a canary (probably)when to know you need a canary

why you definitely need feature flags, no matter whattest doesn't mean what you think it means

the future of development is observavbility-observability-driven development. "O-D-D yeah YOU KNOW ME"

why we have to stop leaning on intuition and tribal knowledge before it is too late

Page 8: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

A few of the things I want to talk about:

software ownershipon call

what it means to be a senior engineer"chaos engineering"

observability, control theorythe so-called three pillars

why SWEs could never be on call with monitoring-era toolingwhy i started honeycomb

why christine started honeycomb

busin ess intelligence, aka why nothing we are doing is remotely newwhy tools create silos

the implications of democratizing access to dataparticularly for levels and career progressions

how deploys must changethe mis allocation of internal tooling energy away rom deploy software

why you need to test in prodwhy you need a canary (probably)when to know you need a canary

why you definitely need feature flags, no matter whattest doesn't mean what you think it means

the future of development is observavbility-observability-driven development. "O-D-D yeah YOU KNOW ME"

why we have to stop leaning on intuition and tribal knowledge before it is too late

Page 9: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

"How did we get here?"

Page 10: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Monitoring (time series databases, dashboards, all based on the 'metric')

Logs (messy ass strings, really)

More recently, APM and tracing have gotten more widespread.

Page 11: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

"How did we get here?"

Page 12: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

"What do we need to get where we're going?"

Page 13: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Microservices require a shift in how we think about software.

Page 14: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Our idea of what the software development lifecycle even looks like is overdue for an upgrade

in the era of distributed systems.

Page 15: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

TDD stops at your laptop’s edge

Page 16: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Deploying code is not a binary switch.

Deploying code is a process of increasing your confidence in your code.

Page 17: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Development Production

deploy

Page 18: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

ObservabilityDevelopment Production

Page 19: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

ObservabilityDevelopment Production

Page 20: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

why now?

Page 21: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

“Complexity is increasing” - Science

Page 22: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Architectural complexity

Parse, 2015LAMP stack, 2005

Page 23: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Parse, 2015LAMP stack, 2005

monitoring => observabilityknown unknowns => unknown unknowns

LAMP stack => distributed systems

Page 24: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

We are all distributed systems engineers now

the unknowns outstrip the knowns

why does this matter more and more?

Page 25: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Distributed systems are particularly hostile to being cloned or imitated (or monitored).

(clients, concurrency, chaotic traffic patterns, edge cases …)

Page 26: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Distributed systems have an infinitely long list of almost-impossible failure scenarios that make staging

environments particularly worthless.

this is a black hole for engineering time

Page 27: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Operational literacyIs not a nice-to-have

Page 28: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Without observability, you don't have "chaos engineering". You just have chaos.

So what is observability?

Page 29: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Observability is NOT the same as monitoring.

Page 30: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

@grepory, Monitorama 2016

“Monitoring is dead.”

“Monitoring systems have not changed significantly in 20 years and has fallen behind the way we build software. Our software is now large distributed systems made up of many non-uniform interacting

components while the core functionality of monitoring systems has stagnated.”

Page 31: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Observability

“In control theory, observability is a measure of how well internal states of a system can be inferred from knowledge of its external outputs. The observability and controllability of a system are mathematical duals." —

wikipedia

… translate??!?

Page 32: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Can you understand what’s happening inside your systems, just by asking questions from the outside? Can you debug your code and its behavior using its output?

Can you answer new questions without shipping new code?

Observability... for software engineers:

Page 33: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Monitoring

Represents the world from the perspective of a third party, and describes the health of the system and/or its components in aggregate.

ObservabilityDescribes the world from the first-person perspective of the software, executing each request. Software explaining itself from the inside out.

Page 34: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

We don’t *know* what the questions are, all we have are unreliable symptoms or reports.

Complexity is exploding everywhere,but our tools are designed for

a predictable world.

As soon as we know the question, we usually know the answer too.

Page 35: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Welcome to distributed systems.

it’s probably fine.(it might be fine?)

Page 36: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Many catastrophic states exist at any given time.

Your system is never entirely ‘up’

Page 37: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Distributed systems have an infinitely long list of almost-impossible failure scenarios that make staging

environments particularly worthless.

this is a black hole for engineering time

Page 38: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

You do it.You have to do it.

Do it well.

Page 39: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Let’s try some examples!

Can you quickly and reliably track down problems like these?

Page 40: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

The app tier capacity is exceeded. Maybe we rolled out a build with a perf regression, or maybe some app instances are down.

DB queries are slower than normal. Maybe we deployed a bad new query, or there is lock contention.

Errors or latency are high. We will look at several dashboards that reflect common root causes, and one of them will show us why.

“Photos are loading slowly for some people. Why?”(old-school LAMP stack)

monitor these things

Page 41: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

“Photos are loading slowly for some people. Why?”(microservices)

Any microservices running on c2.4xlarge instances and PIOPS storage in us-east-1b has a 1/20 chance of running on degraded hardware, and will take 20x longer to complete for requests that hit the disk with a blocking call. This disproportionately impacts people looking at older archives due to our fanout model.

Canadian users who are using the French language pack on the iPad running iOS 9, are hitting a firmware condition which makes it fail saving to local cache … which is why it FEELS like photos are loading slowly

Our newest SDK makes db queries sequentially if the developer has enabled an optional feature flag. Working as intended; the reporters all had debug mode enabled. But flag should be renamed for clarity sake.

wtf do i ‘monitor’ for?!

Page 42: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Problems Symptoms

"I have twenty microservices and a sharded db and three other data stores across three regions, and everything seems to be getting a little bit slower over the past two weeks but nothing has changed that we know of, and oddly, latency is usually back to the historical norm on Tuesdays.

“All twenty app micro services have 10% of available nodes enter a simultaneous crash loop cycle, about five times a day, at unpredictable intervals. They have nothing in common afaik and it doesn’t seem to impact the stateful services. It clears up before we can debug it, every time.”

“Our users can compose their own queries that we execute server-side, and we don’t surface it to them when they are accidentally doing full table scans or even multiple full table scans, so they blame us.”

(microservices)

Page 43: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Still More Symptoms

“Several users in Romania and Eastern Europe are complaining that all push notifications have been down for them … for days.”

“Disney is complaining that once in a while, but not always, they don’t see the photo they expected to see — they see someone else’s photo! When they refresh, it’s fixed. Actually, we’ve had a few other people report this too, we just didn’t believe them.”

“Sometimes a bot takes off, or an app is featured on the iTunes store, and it takes us a long long time to track down which app or user is generating disproportionate pressure on shared components of our system (esp databases). It’s different every time.”

“We run a platform, and it’s hard to programmatically distinguish between problems that users are inflicting themselves and problems in our own code, since they all manifest as the same errors or timeouts."

(microservices)

Page 44: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

These are all unknown-unknownsthat may have never happened before, or ever happen again

(They are also the overwhelming majority of what you have to care about for the rest of your life.)

Page 45: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Three principles of software ownership:

They who write the codeCan and should deploy their code And watch it run it in production.

(**and be on call for it)

Page 46: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

When healthy teams with good cultural values and leadership alignment try to adopt software ownership and fail, the cause is usually an

observability gap.

Page 47: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Software engineers spend too much time looking at code in elaborately falsified environments, and not enough time observing it in the real world.

Tighten feedback loops. Give developers the observability tooling they need to become fluent in

production and to debug their own systems.

We aren’t “writing code”.We are “building systems”.

Page 48: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Observability for SWEs and the Future™well-instrumented

high cardinalityhigh dimensionality

event-drivenstructured

well-ownedsampled

tested in prod.

Page 49: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Watch it run in production.Accept no substitute.

Get used to observing your systems when they AREN’T on fire

Page 50: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Real dataReal usersReal trafficReal scaleReal concurrencyReal networkReal deploysReal unpredictabilities.

Page 51: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

You care about each and every tree, not the forest.

"The health of the system no longer really matters" -- me

Page 52: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Zero users care what the “system” health isAll users care about THEIR experience.

Nines don’t matter if users aren’t happy.Nines don’t matter if users aren’t happy.Nines don’t matter if users aren’t happy.Nines don’t matter if users aren’t happy.

Nines don’t matter if users aren’t happy.

Nines don’t matter if users aren’t happy.

Page 53: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

well-instrumentedhigh cardinality

high dimensionalityevent-drivenstructured

well-ownedsampled

tested in prod.

Observability for SWEs and the Future™

Page 54: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

You win … Drastically fewer paging alerts!

Page 55: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Charity Majors@mipsytipsy

Page 56: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

• Srecon

Charity Majors@mipsytipsy

Page 57: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels
Page 58: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels
Page 59: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Get inside the software’s headExplain it back to yourself/a naive user

The right level of abstraction is keyWrap all network calls, etc

Open up all black boxes to inspection

Page 60: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Events tell stories.

Arbitrarily wide events mean you can amass more and more context over time. Use sampling to control costs and bandwidth.

Structure your data at the source to reapmassive efficiencies over strings.

(“Logs” are just a transport mechanism for events)

Page 61: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

UUIDsdb raw queries

normalized queriescomments

firstname, lastnamePID/PPID

app IDdevice ID

HTTP header typebuild IDIP:port

shopping cart IDuserid... etcHigh cardinality will save your ass.

Page 62: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

You must be able to break down by 1/millions and THEN by anything/everything else

High cardinality is not a nice-to-have

‘Platform problems’ are now everybody’s problems

Page 63: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Jumps to an answer, instead of starting with a question

You don’t know what you don’t know.

Artifacts of past failures.

Page 64: @mipsytipsy€¦ · business intelligence, aka why nothing we are doing is remotely new why tools create silos the implications of democratizing access to data particularly for levels

Aggregation is a one-way tripDestroying raw events eliminates your ability to ask new questions.

Forever.