90% of datawas created in the last two years
800% data growthis projected by 2017
Only 0.5% of data is being put to use
Metrics DrivenDevelopment
from .ioby Aleksandr and Vlad
Genesis50m monthly audience, 10 projects, different platforms,
emerging markets, no experience
.io4 technical people, multiple products, hundreds of
servers, hundreds of millions of requests, terabytes of data
6 months900% growth
HorrorBugs, downtimes, wrong implementations, performance
issues, no time at all, angry people
How we do it?try -> learn
How we do it?try -> learn -> retry
How we do it?try -> learn -> retry -> result
Faster = better 20 deployments and 200 new errors a day
We want to knowIs everything ok?
We want to knowIs everything ok? What isn’t ok?
We want to knowIs everything ok? What isn’t ok? Why that happened?
We want to knowIs everything ok? What isn’t ok? Why that happened?
And how to fix that?
We want to knowIs everything ok? What isn’t ok? Why that happened?
And how to fix that? Right now.
Real TimeIsn’t about fast queries and in-memory databases.
Real TimeIsn’t about fast queries and in-memory databases.
It’s about having a chance to affect.
SystemNetwork, Disks, CPUs…
Munin, Zabbix
ApplicationAPI cals, queue size, execution times, errors…
New Relic, Data Dog, Graphite
BusinessClicks, signups, returns, actions…
statsd + t
EventsAnything that happens
foreach ( $users as $id ) { send_notification($id); increment('notification.sent'); }
EventsAnything that happens
ValuesAnything that changes
user_insert($user_data); increment('moderation.queue');
ValuesAnything that changes
ErrorsCan’t escape, but can react quickly
set_error_handler(function() { increment("app.fuckups"); });
ErrorsCan’t escape, but can react quickly
DebugTrack events that shouldn’t happen
increment('data.process.start');
if ( prepare($data) ) increment('data.process.prepared');
if ( send($data) ) increment('data.process.sent');
DebugTrack events that shouldn’t happen
ImplementationQA can’t assure in real world
$am_i_online = is_online($_SESSION['user_id']); if ( !$am_i_online ) increment('bugs.online_paradox');
ImplementationQA can’t assure in real world
Hard questionsWhat is our users loss because of slow moderation?
function on_moderate() { $waited = time() - $photo['time']; timing("moderation.wait", $waited); if ( $waited > 300 ) increment('user.lost'); }
Hard questionsWhat is our users loss because of slow moderation?
Daily dynamicsWhat is happening today?
Minute dynamicsWhat is happening right now?
AlertsEmail me if anything happens with critical metrics
Alert me if "*error*" is more than 0
DashboardsCreate frequently, delete frequently
EvolveCallbacks based on metric value
If waiting log is more than 5Gb, send callback to http://dev.onthe.io/sys?add_bulk=1
AdvancedSlices, anomalies, correlations
SlicesSlice each metric by specific criteria
(visits from region, team performance by member, etc)
AnomaliesDetect unusual changes
CorrelationsFind correlated metrics
CultureImplement -> track -> check