(dvo205) monitoring evolution: flying blind to flying by instrument

Post on 15-Apr-2017

882 Views

Category:

Technology

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

DVO205

The AdRoll Monitoring Evolution:From Flying Blind to Flying by Instrument

Brian Troutwine, AdRoll

Ilan Rabinovitch, Datadog

October 2015

Today’s speakers

Ilan Rabinovitch

Dir. Technical Community

Datadog

Brian Troutwine

Sr. Software Engineer

AdRoll

Quick Overview of

Datadog

• Monitoring for modern applications

• Dynamic Infrastructure

• Microservices

• Time series storage of metrics and events

• 100s of built in integrations

• Eg. EC2, ELB, ECS and more.

CAMS

Culture

Automation

Metrics

SharingDamon Edwards and John Willis

CAMS

Culture

Automation

METRICS

SHARING

You’re in the cloud and it's everything you dreamed of!

You’re in the cloud and it's everything you dreamed of!

AutoscalingContainer

orchestrationInfinite storage

In cloud we trust.

But how do we verify health?

If it moves, monitor it.

How does your current monitoring fit in?

• Host-centric

How does our current monitoring fit in?

• Host-centric

• Static configurations tracking dynamic infrastructure

How does our current monitoring fit in?

• Host-centric

• Static configurations tracking

dynamic infrastructure

• Focused on resources, rather than

work

How does our current monitoring fit in?

• Host-centric

• Static configurations tracking dynamic infrastructure

• Focused on resources, rather than work

• Difficult to pull together and compare data from

multiple sources

How does our current monitoring fit in?

So what to monitor?

More at: http://goo.gl/t1Rgcg

How to use that data?

More at: http://goo.gl/t1Rgcg

Recurse until you find root cause

Query-based monitoring• Aggregates matter because the underlying infrastructure is dynamic

• Express our monitors or alerts as queries on predicates:

• “avg response time for requests to hosts running nginx > 500

ms”

• “min # of hosts running nginx < 3”

• Mash up data sources for a 360-degree view of a problem

Query-based monitoring“Show me iowait across nginx hosts, grouped by

availability zone”

Real-time

bidding

The problem domain

• Low latency (< 100 ms per transaction)

• Firm real-time system

• Highly concurrent ( ~2 million transactions

per second, peak)

• Global, 24/7 operation

In the early days of the

AdRoll real-time bidding

(RTB) project, we could

use our intuition.

• The system was simple.• The number of total

requests was small.• The impact of mistakes

was minor.

We could be reasonably

confident that our mental

model of the system’s

behavior was accurate.

The trouble with a

complex system is that its

behavior in practice gets

away from you pretty fast.

Our first approach was

to batch process logs

generated by

individual bidders.

Batch processing

Pros:

• We were already doing this.

• It’s simple to implement.

• It’s straightforward to

conceive.

Batch processing

Cons:

• High update latency

• Catastrophic errors lose logs

• Denies impulse

experimentation

Batch processing

Our second approach

was to generate coarse

real-time metrics and

analyze those.

Coarse real-time metrics

Pros:

• Iterative step up from

batch processing

• Proves out the concept

• Simple to implement

Coarse real-time metrics

Cons:

• Still relied on intuition

• Bidder implementation was

sub-optimal

• Dashboards were one-size-

fits-all approach

Coarse real-time metrics

By this point, the complexity of the

system and our ambitions were

growing.

• Two engineers were added to the

team.

• Tens more in the department.

• RTB became a central project.

We were making

decisions in a

knowledge void.

At this point, we have

AWS CloudWatch.

CloudWatch

reports the basic

health of your

system.

CloudWatch provides the total view of the AWS services you’re using.

What we don’t have at

this point is a detailed

view of our system.

What we don’t have is

the ability to explore the

information we have,

especially in high-

stress situations.

Exometer solves our

Erlang-side problem.

Detailed application-

level instrumentation is

cheap and easy.

Datadog solves

our aggregation,

visualization, and

alerting problems.

Datadog integrates with

CloudWatch. Our system-

specific metrics can be

correlated with the basic

health of the system.

This can be done in real time.

Correlation of system information

This can be done in

high-stress situations.

Correlation of system information

This can be done by

other departments of

the business.

Correlation of system information

A bid “times out”

when we don’t reply

back to the exchange

in 100 ms.

Timeout spikes

Timeout spikes

We didn’t realize this

was happening. It’s an

early win of our

sophisticated monitoring.

Timeout spikes

Timeout spikes

System load is normal.

There’s not a periodic

spike in bid request traffic.

Timeout spikes

Timeout spikes

There is a correlated

jump in network

traffic, however.

Timeout spikes

Timeout spikes

There were also

correlated spikes in

the Erlang VM’s

process run queue.

Timeout spikes

• VM scheduler threads are locked to CPU

• CPU-intensive background process kicks

on every 20 minutes

• No CPU shield on the server

• VM scheduler thread gets kicked from its

assigned CPU, processes back up

Timeout spikes

Failure of bid-

request traffic is an

all-hands problem.

Traffic crash

Without traffic, the

bidders can do nothing.

Traffic crash

Traffic crash

That’s a healthy couple

of days' worth of traffic.

It dips in the night, and

climbs in the day.

Traffic crash

This is a weekend’s

worth of traffic lost.

Traffic crash

• Confirmed with CloudWatch that

networking to the machines was fine

• No changes had been made to the

production system (it was a looser

time)

• All detail metrics from the Erlang VM

are acceptable

Traffic crash

Traffic crash

The exchange

confirmed a drop in

traffic from their system.

Traffic crash

Turns out, we hit an

implicit exchange

limitation.

Traffic crash

We also became more

conscientious about

alerting effectively.

Traffic crash

At high scale, it’s very

easy to have to be

over-provisioned for

the system’s load.

Sophisticated autoscaling

Worse, it’s very easy to

be under provisioned

for system load.

Sophisticated autoscaling

All CloudWatch alarms

on EC2 instances can

be pressed into service

for autoscaling.

Sophisticated autoscaling

Our first autoscaling

approach used

remaining idle CPU.

Sophisticated autoscaling

Sophisticated autoscaling

Sophisticated autoscaling

As traffic drops off at

the end of the day,

we need less CPU

time to process it.

Sophisticated autoscaling

This was great! We

immediately saved

loads of money.

Sophisticated autoscaling

Problem was, it’s an

indirect measurement.

There’s always some

nuance you’ll miss.

Sophisticated autoscaling

Co-resident subsystems

eat into the CPU time,

giving an inaccurate

impression.

Sophisticated autoscaling

CPU consumption

carries no information

about aberrant

system issues.

Sophisticated autoscaling

What can be done?

Sophisticated autoscaling

Distill the performance

capability of your

system into a single

signal.

Sophisticated autoscaling

Sophisticated autoscaling

Sophisticated autoscaling

The "metadata index"

tracks the load on the

bidders. It’s a weighted

sum of key metrics.

Sophisticated autoscaling

As traffic drops, the

metadata index

drops. Indirectly, idle

CPU increases.

Sophisticated autoscaling

We emit this

metadata index into

CloudWatch as a

custom metric.

Sophisticated autoscaling

As soon as it hits

CloudWatch, you

can autoscale on it.

Sophisticated autoscaling

Sophisticated autoscaling

Sophisticated autoscaling

Sophisticated autoscaling

This is twice as efficient

as the CPU idle scaling

signal. One-half the

number of machines.

Sophisticated autoscaling

There’s a lot of fraud in

the online advertising

industry.

Anti-fraud CookieBouncer

A certain kind of “hot

cookie” fraud

caused a tolerable

fault in the bidders.

Anti-fraud CookieBouncer

Anti-fraud CookieBouncer

CookieBouncer

blocks bidding on

fraudulent, “hot,”

cookies in real time.

Anti-fraud CookieBouncer

Our concern was

blocking too much traffic,

in turn blocking

legitimate bids through

over-aggressive tuning.

Anti-fraud CookieBouncer

We built a new

CookieBouncer

dashboard and introduced

the ability to tune it in real

time on every bidder.

Anti-fraud CookieBouncer

We rolled CookieBouncer

out with conservative

settings and started

adjusting, keeping tabs

on the key indicators.

Anti-fraud CookieBouncer

Anti-fraud CookieBouncer

We adjusted and were

very surprised at the total

number of blocked

cookies and the

percentage of total traffic.

Anti-fraud CookieBouncer

Anti-fraud CookieBouncer

Anti-fraud CookieBouncer

The instrumentation

speaks for itself.

Anti-fraud CookieBouncer

Anti-fraud CookieBouncer

Learn more at….DVO204 - Monitoring Strategies: Finding Signal in the Noise

Thursday, Oct 8, 11:00 AM - 12:00 PM

OR

http://bit.ly/1Qo4Zmy

Thank you!

Remember to complete

your evaluations!

top related