(dvo205) monitoring evolution: flying blind to flying by instrument

107
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved. DVO205 The AdRoll Monitoring Evolution: From Flying Blind to Flying by Instrument Brian Troutwine, AdRoll Ilan Rabinovitch, Datadog October 2015

Upload: amazon-web-services

Post on 15-Apr-2017

882 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

DVO205

The AdRoll Monitoring Evolution:From Flying Blind to Flying by Instrument

Brian Troutwine, AdRoll

Ilan Rabinovitch, Datadog

October 2015

Page 2: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Today’s speakers

Ilan Rabinovitch

Dir. Technical Community

Datadog

Brian Troutwine

Sr. Software Engineer

AdRoll

Page 3: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Quick Overview of

Datadog

• Monitoring for modern applications

• Dynamic Infrastructure

• Microservices

• Time series storage of metrics and events

• 100s of built in integrations

• Eg. EC2, ELB, ECS and more.

Page 4: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

CAMS

Culture

Automation

Metrics

SharingDamon Edwards and John Willis

Page 5: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

CAMS

Culture

Automation

METRICS

SHARING

Page 6: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

You’re in the cloud and it's everything you dreamed of!

Page 7: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

You’re in the cloud and it's everything you dreamed of!

AutoscalingContainer

orchestrationInfinite storage

Page 8: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

In cloud we trust.

But how do we verify health?

Page 9: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

If it moves, monitor it.

Page 10: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

How does your current monitoring fit in?

Page 11: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
Page 12: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

• Host-centric

How does our current monitoring fit in?

Page 13: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

• Host-centric

• Static configurations tracking dynamic infrastructure

How does our current monitoring fit in?

Page 14: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

• Host-centric

• Static configurations tracking

dynamic infrastructure

• Focused on resources, rather than

work

How does our current monitoring fit in?

Page 15: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

• Host-centric

• Static configurations tracking dynamic infrastructure

• Focused on resources, rather than work

• Difficult to pull together and compare data from

multiple sources

How does our current monitoring fit in?

Page 16: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

So what to monitor?

More at: http://goo.gl/t1Rgcg

Page 17: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

How to use that data?

More at: http://goo.gl/t1Rgcg

Page 18: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Recurse until you find root cause

Page 19: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Query-based monitoring• Aggregates matter because the underlying infrastructure is dynamic

• Express our monitors or alerts as queries on predicates:

• “avg response time for requests to hosts running nginx > 500

ms”

• “min # of hosts running nginx < 3”

• Mash up data sources for a 360-degree view of a problem

Page 20: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Query-based monitoring“Show me iowait across nginx hosts, grouped by

availability zone”

Page 21: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
Page 22: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument
Page 23: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Real-time

bidding

Page 24: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

The problem domain

• Low latency (< 100 ms per transaction)

• Firm real-time system

• Highly concurrent ( ~2 million transactions

per second, peak)

• Global, 24/7 operation

Page 25: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

In the early days of the

AdRoll real-time bidding

(RTB) project, we could

use our intuition.

Page 26: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

• The system was simple.• The number of total

requests was small.• The impact of mistakes

was minor.

Page 27: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

We could be reasonably

confident that our mental

model of the system’s

behavior was accurate.

Page 28: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

The trouble with a

complex system is that its

behavior in practice gets

away from you pretty fast.

Page 29: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Our first approach was

to batch process logs

generated by

individual bidders.

Batch processing

Page 30: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Pros:

• We were already doing this.

• It’s simple to implement.

• It’s straightforward to

conceive.

Batch processing

Page 31: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Cons:

• High update latency

• Catastrophic errors lose logs

• Denies impulse

experimentation

Batch processing

Page 32: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Our second approach

was to generate coarse

real-time metrics and

analyze those.

Coarse real-time metrics

Page 33: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Pros:

• Iterative step up from

batch processing

• Proves out the concept

• Simple to implement

Coarse real-time metrics

Page 34: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Cons:

• Still relied on intuition

• Bidder implementation was

sub-optimal

• Dashboards were one-size-

fits-all approach

Coarse real-time metrics

Page 35: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

By this point, the complexity of the

system and our ambitions were

growing.

• Two engineers were added to the

team.

• Tens more in the department.

• RTB became a central project.

Page 36: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

We were making

decisions in a

knowledge void.

Page 37: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

At this point, we have

AWS CloudWatch.

Page 38: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

CloudWatch

reports the basic

health of your

system.

Page 39: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

CloudWatch provides the total view of the AWS services you’re using.

Page 40: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

What we don’t have at

this point is a detailed

view of our system.

Page 41: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

What we don’t have is

the ability to explore the

information we have,

especially in high-

stress situations.

Page 42: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Exometer solves our

Erlang-side problem.

Detailed application-

level instrumentation is

cheap and easy.

Page 43: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Datadog solves

our aggregation,

visualization, and

alerting problems.

Page 44: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Datadog integrates with

CloudWatch. Our system-

specific metrics can be

correlated with the basic

health of the system.

Page 45: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

This can be done in real time.

Correlation of system information

Page 46: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

This can be done in

high-stress situations.

Correlation of system information

Page 47: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

This can be done by

other departments of

the business.

Correlation of system information

Page 48: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

A bid “times out”

when we don’t reply

back to the exchange

in 100 ms.

Timeout spikes

Page 49: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Timeout spikes

Page 50: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

We didn’t realize this

was happening. It’s an

early win of our

sophisticated monitoring.

Timeout spikes

Page 51: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Timeout spikes

Page 52: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

System load is normal.

There’s not a periodic

spike in bid request traffic.

Timeout spikes

Page 53: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Timeout spikes

Page 54: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

There is a correlated

jump in network

traffic, however.

Timeout spikes

Page 55: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Timeout spikes

Page 56: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

There were also

correlated spikes in

the Erlang VM’s

process run queue.

Timeout spikes

Page 57: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

• VM scheduler threads are locked to CPU

• CPU-intensive background process kicks

on every 20 minutes

• No CPU shield on the server

• VM scheduler thread gets kicked from its

assigned CPU, processes back up

Timeout spikes

Page 58: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Failure of bid-

request traffic is an

all-hands problem.

Traffic crash

Page 59: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Without traffic, the

bidders can do nothing.

Traffic crash

Page 60: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Traffic crash

Page 61: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

That’s a healthy couple

of days' worth of traffic.

It dips in the night, and

climbs in the day.

Page 62: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Traffic crash

Page 63: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

This is a weekend’s

worth of traffic lost.

Traffic crash

Page 64: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

• Confirmed with CloudWatch that

networking to the machines was fine

• No changes had been made to the

production system (it was a looser

time)

• All detail metrics from the Erlang VM

are acceptable

Traffic crash

Page 65: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Traffic crash

Page 66: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

The exchange

confirmed a drop in

traffic from their system.

Traffic crash

Page 67: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Turns out, we hit an

implicit exchange

limitation.

Traffic crash

Page 68: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

We also became more

conscientious about

alerting effectively.

Traffic crash

Page 69: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

At high scale, it’s very

easy to have to be

over-provisioned for

the system’s load.

Sophisticated autoscaling

Page 70: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Worse, it’s very easy to

be under provisioned

for system load.

Sophisticated autoscaling

Page 71: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

All CloudWatch alarms

on EC2 instances can

be pressed into service

for autoscaling.

Sophisticated autoscaling

Page 72: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Our first autoscaling

approach used

remaining idle CPU.

Sophisticated autoscaling

Page 73: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

Page 74: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

As traffic drops off at

the end of the day,

we need less CPU

time to process it.

Page 75: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

Page 76: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

This was great! We

immediately saved

loads of money.

Sophisticated autoscaling

Page 77: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Problem was, it’s an

indirect measurement.

There’s always some

nuance you’ll miss.

Sophisticated autoscaling

Page 78: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Co-resident subsystems

eat into the CPU time,

giving an inaccurate

impression.

Sophisticated autoscaling

Page 79: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

CPU consumption

carries no information

about aberrant

system issues.

Sophisticated autoscaling

Page 80: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

What can be done?

Sophisticated autoscaling

Page 81: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Distill the performance

capability of your

system into a single

signal.

Sophisticated autoscaling

Page 82: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

Page 83: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

The "metadata index"

tracks the load on the

bidders. It’s a weighted

sum of key metrics.

Page 84: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

As traffic drops, the

metadata index

drops. Indirectly, idle

CPU increases.

Page 85: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

We emit this

metadata index into

CloudWatch as a

custom metric.

Page 86: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

As soon as it hits

CloudWatch, you

can autoscale on it.

Page 87: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

Page 88: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

Page 89: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

Page 90: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

This is twice as efficient

as the CPU idle scaling

signal. One-half the

number of machines.

Page 91: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Sophisticated autoscaling

Page 92: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

There’s a lot of fraud in

the online advertising

industry.

Anti-fraud CookieBouncer

Page 93: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

A certain kind of “hot

cookie” fraud

caused a tolerable

fault in the bidders.

Anti-fraud CookieBouncer

Page 94: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Anti-fraud CookieBouncer

Page 95: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

CookieBouncer

blocks bidding on

fraudulent, “hot,”

cookies in real time.

Anti-fraud CookieBouncer

Page 96: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Our concern was

blocking too much traffic,

in turn blocking

legitimate bids through

over-aggressive tuning.

Anti-fraud CookieBouncer

Page 97: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

We built a new

CookieBouncer

dashboard and introduced

the ability to tune it in real

time on every bidder.

Anti-fraud CookieBouncer

Page 98: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

We rolled CookieBouncer

out with conservative

settings and started

adjusting, keeping tabs

on the key indicators.

Anti-fraud CookieBouncer

Page 99: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Anti-fraud CookieBouncer

Page 100: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

We adjusted and were

very surprised at the total

number of blocked

cookies and the

percentage of total traffic.

Anti-fraud CookieBouncer

Page 101: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Anti-fraud CookieBouncer

Page 102: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Anti-fraud CookieBouncer

Page 103: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

The instrumentation

speaks for itself.

Anti-fraud CookieBouncer

Page 104: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Anti-fraud CookieBouncer

Page 105: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Learn more at….DVO204 - Monitoring Strategies: Finding Signal in the Noise

Thursday, Oct 8, 11:00 AM - 12:00 PM

OR

http://bit.ly/1Qo4Zmy

Page 106: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Thank you!

Page 107: (DVO205) Monitoring Evolution: Flying Blind to Flying by Instrument

Remember to complete

your evaluations!