webcamp:back-end developers day Александр Крутько "metrics-driven...

56

Upload: geekslab

Post on 14-Aug-2015

50 views

Category:

Technology


4 download

TRANSCRIPT

Datais growing

90% of datawas created in the last two years

800% data growthis projected by 2017

Only 0.5% of data is being put to use

Metrics DrivenDevelopment

from .ioby Aleksandr and Vlad

Genesis50m monthly audience, 10 projects, different platforms,

emerging markets, no experience

.io4 technical people, multiple products, hundreds of

servers, hundreds of millions of requests, terabytes of data

6 months900% growth

HorrorBugs, downtimes, wrong implementations, performance

issues, no time at all, angry people

Horror

Challenge

How we do it?

How we do it?try

How we do it?try -> learn

How we do it?try -> learn -> retry

How we do it?try -> learn -> retry -> result

Faster = better 20 deployments and 200 new errors a day

We want to know

We want to knowIs everything ok?

We want to knowIs everything ok? What isn’t ok?

We want to knowIs everything ok? What isn’t ok? Why that happened?

We want to knowIs everything ok? What isn’t ok? Why that happened?

And how to fix that?

We want to knowIs everything ok? What isn’t ok? Why that happened?

And how to fix that? Right now.

Real TimeIsn’t about fast queries and in-memory databases.

Real TimeIsn’t about fast queries and in-memory databases.

It’s about having a chance to affect.

Metrics

SystemNetwork, Disks, CPUs…

Munin, Zabbix

ApplicationAPI cals, queue size, execution times, errors…

New Relic, Data Dog, Graphite

BusinessClicks, signups, returns, actions…

statsd + t

Tracking

EventsAnything that happens

foreach ( $users as $id ) { send_notification($id); increment('notification.sent'); }

EventsAnything that happens

ValuesAnything that changes

user_insert($user_data); increment('moderation.queue');

ValuesAnything that changes

ErrorsCan’t escape, but can react quickly

set_error_handler(function() { increment("app.fuckups"); });

ErrorsCan’t escape, but can react quickly

DebugTrack events that shouldn’t happen

increment('data.process.start');

if ( prepare($data) ) increment('data.process.prepared');

if ( send($data) ) increment('data.process.sent');

DebugTrack events that shouldn’t happen

ImplementationQA can’t assure in real world

$am_i_online = is_online($_SESSION['user_id']); if ( !$am_i_online ) increment('bugs.online_paradox');

ImplementationQA can’t assure in real world

Hard questionsWhat is our users loss because of slow moderation?

function on_moderate() { $waited = time() - $photo['time']; timing("moderation.wait", $waited); if ( $waited > 300 ) increment('user.lost'); }

Hard questionsWhat is our users loss because of slow moderation?

Tools

Daily dynamicsWhat is happening today?

Minute dynamicsWhat is happening right now?

AlertsEmail me if anything happens with critical metrics

Alert me if "*error*" is more than 0

DashboardsCreate frequently, delete frequently

EvolveCallbacks based on metric value

If waiting log is more than 5Gb, send callback to http://dev.onthe.io/sys?add_bulk=1

AdvancedSlices, anomalies, correlations

SlicesSlice each metric by specific criteria

(visits from region, team performance by member, etc)

AnomaliesDetect unusual changes

CorrelationsFind correlated metrics

CultureImplement -> track -> check