it's all about telemetry

55
It’s all about telemetry Monitoring what matters in a useful way. Tuesday, June 26, 12

Upload: theo-schlossnagle

Post on 16-May-2015

6.039 views

Category:

Technology


0 download

DESCRIPTION

The ins and outs of monitoring your technology enabled business.

TRANSCRIPT

Page 1: It's all about telemetry

It’s all about telemetryMonitoring what matters in a useful way.

Tuesday, June 26, 12

Page 2: It's all about telemetry

Theo Schlossnagle @postwait

I write software

I write books

I give talks

I participate in the industry

I speak frankly about industry issues

Tuesday, June 26, 12

Page 3: It's all about telemetry

Data, data, everywhere.

A billion pageviews / month.

100k database queries / second.

1MM memcache queries / second.

500k MQ messages / second.

10MM I/O operations / second.

Tuesday, June 26, 12

Page 4: It's all about telemetry

Most new big data problems are

solvable

Big Data

Tuesday, June 26, 12

Page 5: It's all about telemetry

Most new big data problems arecreated by our solutions, and thussolvabledespite their ROI

Most new big data problems are

solvable

Big Data

Tuesday, June 26, 12

Page 6: It's all about telemetry

That’s a whole lot of data

Think in terms of logs (too many do)

About 26 trillion log lines / month

@ 40 bytes compressed: 1PB / month

Just because it is possibledoes not mean it will return on investment(and does not mean it won’t)

Tuesday, June 26, 12

Page 7: It's all about telemetry

It’s all “useful”; which data?

Think in terms of cost/benefit.

Sure the data is useful, but it costs money to store

Does it cost you more to have it or not to have it?

Maybe the right approach is to keep that level of detail for a few days?

Tuesday, June 26, 12

Page 8: It's all about telemetry

Double-edged sword.

Eroding granularity over timekeeps storage under control

Tuesday, June 26, 12

Page 9: It's all about telemetry

Double-edged sword.

Eroding granularity over timekeeps storage under control

MISTAKE

Tuesday, June 26, 12

Page 10: It's all about telemetry

1 yearat a glance

Tuesday, June 26, 12

Page 11: It's all about telemetry

1 weeklooks normalish

Tuesday, June 26, 12

Page 12: It's all about telemetry

1 dayconfidence of normalcy increases

Tuesday, June 26, 12

Page 13: It's all about telemetry

1 weekthat looks different

Tuesday, June 26, 12

Page 14: It's all about telemetry

1 dayyup, that’s not at all like that other week

Tuesday, June 26, 12

Page 15: It's all about telemetry

Other methods

What do you store?

How do you store it?

Why is it useful?

Winning the cost benefit game byreducing costs more significantly thanreducing benefits

Tuesday, June 26, 12

Page 16: It's all about telemetry

0 0.5 1 1.5 2 2.5 3

0.25

0.5

0.75

1

Benefit

Cost

Positive ValueBe in the green.

monitoring activity ➠

Tuesday, June 26, 12

Page 17: It's all about telemetry

0 1 2 3 4 5 6 7 8 9 10

2.5

5

7.5

10

Benefit

Cost

There’s a bigger pictureIt’s not as easy as you think.

monitoring activity ➠

Tuesday, June 26, 12

Page 18: It's all about telemetry

0 0.5 1 1.5 2 2.5 3

0.25

0.5

0.75

1

Benefit

Cost

Value is difference, not areaGreen can be misleading

monitoring activity ➠

Tuesday, June 26, 12

Page 19: It's all about telemetry

0.5 1 1.5 2 2.5 3

-1

-0.75

-0.5

-0.25

0.25

0.5

Value = Benefit - CostGreen means we have positive return

monitoring activity ➠

Tuesday, June 26, 12

Page 20: It's all about telemetry

0.5 1 1.5 2 2.5 3

-1

-0.75

-0.5

-0.25

0.25

0.5

It’s not about returnWell, it’s not only about return

monitoring activity ➠

Tuesday, June 26, 12

Page 21: It's all about telemetry

0.5 1 1.5 2 2.5 3

-1

-0.75

-0.5

-0.25

0.25

0.5

It’s about maximizing returnThis is a bit like black magic

monitoring activity ➠

Tuesday, June 26, 12

Page 22: It's all about telemetry

Technique 1: text

Store changes

Tuesday, June 26, 12

Page 23: It's all about telemetry

Technique 2: numericStore rollups(i.e. statistical aggregates over fixed windows)

over 1 minute store

min/max/avg/stddev/covariance/50%/95%/99%

lots of information

heavy lossy compression of high-frequency data

loses population distribution information

Tuesday, June 26, 12

Page 24: It's all about telemetry

Database replicationLag (green) and rate of lag change (purple)

Tuesday, June 26, 12

Page 25: It's all about telemetry

Storage UsageWe can see growth.More useful, we can use this to project.

Tuesday, June 26, 12

Page 26: It's all about telemetry

Storage UsageWe can see growth.More useful, we can use this to project.

Tuesday, June 26, 12

Page 27: It's all about telemetry

With simple numeric data

Tuesday, June 26, 12

Page 28: It's all about telemetry

With simple numeric dataUnknowns can be predicted

Tuesday, June 26, 12

Page 29: It's all about telemetry

With simple numeric dataIn sane ways with confidence

Tuesday, June 26, 12

Page 30: It's all about telemetry

Full Disclosure

You see awesome examples of predictive analytics

Like the real-world one on the previous slide

In practice, almost all data streams predict one thing:

they have no fucking clue.

Tuesday, June 26, 12

Page 31: It's all about telemetry

Technique 3: histograms

Store histograms

over 1 minute store

counts of datapoints seen in various buckets

retains complete population distribution

loss of precision

Tuesday, June 26, 12

Page 32: It's all about telemetry

Histograms 101This.

This is a histogram.

It shows the frequency ofvalues within a population.

Height represents frequency

Tuesday, June 26, 12

Page 33: It's all about telemetry

Histograms 101This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, height and colorrepresents frequency

Tuesday, June 26, 12

Page 34: It's all about telemetry

This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, only colorrepresents frequency

Histograms 101

Tuesday, June 26, 12

Page 35: It's all about telemetry

This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, only colorrepresents frequency

Histograms 101

Tuesday, June 26, 12

Page 36: It's all about telemetry

This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, only colorrepresents frequency

Histograms ➠ time series

Tuesday, June 26, 12

Page 37: It's all about telemetry

This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, only colorrepresents frequency

Histograms ➠ time series

Tuesday, June 26, 12

Page 38: It's all about telemetry

This.

This is a histogram.

It shows the frequency ofvalues within a population.

Now, only colorrepresents frequency

Histograms ➠ time series

at a single time interval

Tuesday, June 26, 12

Page 39: It's all about telemetry

API Service TimesWe can see a full population shiftof several milliseconds

Tuesday, June 26, 12

Page 40: It's all about telemetry

Combining techniques

In our system (as a reference point)

Arbitrary numbers of numeric data pointson a single streamoccupy 32 bytes of space for statistical aggregates andoccupy about 2k of space for a histogram

These means we can store these transforms on numeric data in perpetuity

Tuesday, June 26, 12

Page 41: It's all about telemetry

Combining techniques

Text is a bit harder

You need to be careful

Some data sources can be constantly changing

Producing gobs of change data

You’re doing it wrong

Find these and fix them

Tuesday, June 26, 12

Page 42: It's all about telemetry

Correlating EventsChange Management vs. Performance

Tuesday, June 26, 12

Page 43: It's all about telemetry

Correlating EventsChange Management vs. Performance

Tuesday, June 26, 12

Page 44: It's all about telemetry

What to monitor?

Most people don’t monitor the things that matter most

Tuesday, June 26, 12

Page 45: It's all about telemetry

Monitor the Business

Financials:

Revenues. Costs. Margins. AR. Account delinquency.

Marketing:

Web analytics. Campaigns. Costs. Returns. Convergence.

Tuesday, June 26, 12

Page 46: It's all about telemetry

Monitor the Support

Customer Service:

Problems. Time investment. Customer satisfaction. Resolution time.

Tuesday, June 26, 12

Page 47: It's all about telemetry

Monitor the Engineering

Engineering:

Deployments. Test coverage.Bug reports. Bug fixes. Effort spent.

Operations:

Faults. Pages. Escalations. Provisioning time. Equipment defect rates. 3rd party failure rates.

Tuesday, June 26, 12

Page 48: It's all about telemetry

Monitor the Service

Systems:

Networks. Systems. Storage.

Databases:

Performance. Error rates. Backups.

Middleware:

Herein lies the magic and room for awesomeness

Tuesday, June 26, 12

Page 49: It's all about telemetry

Monitor the Middleware

Your systems are complex

Monitor their interactions

Messaging, APIs, etc.

Tuesday, June 26, 12

Page 50: It's all about telemetry

Monitor all the things.

But, perhaps most importantly...

Tuesday, June 26, 12

Page 51: It's all about telemetry

Monitor all the things.

But, perhaps most importantly...

USE UNIFIED TOOLING

Tuesday, June 26, 12

Page 52: It's all about telemetry

What we use...

reconnoiter

SNMP, nad, resmon, statsd, HTTP traps, jdbc, etc.

statsd (clients)

javascript beacons

Tuesday, June 26, 12

Page 53: It's all about telemetry

Middleware mixAPI service times, traffic, user signup rates.

Tuesday, June 26, 12

Page 54: It's all about telemetry

Tuesday, June 26, 12

Page 55: It's all about telemetry

Thank you!

Tuesday, June 26, 12