metrics stack 2.0

Credit: user niteroi @ panoramio.com

vimeo.com/43800150

1 Metrics 2.0 concepts

2 Implementation

3 Advanced stuff

“Dieter” ?

Peter Deter→

Terminology sync

(1234567890, 82)

(1234567900, 123)

(1234567910, 109)

(1234567920, 77)

db15.mysql.queries_running

host=db15 mysql.queries_running

How many pagerequests/s is vimeo.com doing?

● stats.hits.vimeo_com

● stats_counts.hits.vimeo_com

stats.<host>.requesthostport.vimeo_com_443

stats.timers.dfs5.proxyserver.object.GET.200.timing.upper_90

O(X*Y*Z)X = # apps

Y = # people

Z = # aggregators

How long does it take to retrieve an object from swift?

stats.timers.<host>.proxyserver.<swift_type>.<http_method>.<http_code>.timing.<stat>

stats.timers.<host>.objectserver.<http_method>.timing.<stat>

target=stats.timers.dfs*.object*GET*timing.mean ?

target=groupByNode(stats.timers.dfs*.proxyserver.object.GET.*.timing.mean,2,"avg")

target=stats.timers.dfs*.objectserver.GET.timing.mean

swift_type=object stat=mean timing GET avg by http_code

O((DxV)^2)D = # dimensions

V = # values per dim

collectd.db.disk.sda1.disk_time.write

What should I name my metric?

101001000

100001000001000000

Metrics 2.0

Old:● information lacking

● fields unclear & inconsistent

● cumbersome strings / trees

● forbidden characters

New:● Selfdescribing

● Standardized

● all dimensions in orthogonal tagspace

● Allow some useful characters

stats.timers.dfs5.proxyserver.object.GET.200.timing.upper_90

{ “server”: “dfvimeodfsproxy5”, “http_method”: “GET”, “http_code”: “200”, “unit”: “ms”, “target_type”: “gauge”, “stat”: “upper_90”, “swift_type”: “object” “plugin”: “swift_proxy_server”}

Main advantages:● Immediate understanding of metric meaning (ideally)

● Minimize time to graphs, dashboards, alerting rules

github.com/vimeo/graphexplorer/wiki

SI + IEC

B Err Warn Conn Job File Req ...

MB/s Err/d Req/h ...

{

“site”: “vimeo.com”,

“port”: 80,

“unit”: “Req/s”,

“direction”: “in”,

“service”: “webapp_php”,

“server”: “webxx”

}

Carbontagger:

... service=foo.instance=host.target_type=gauge.type=calculation.unit=B 123 1234567890

…

Statsdaemon:

..unit=B..unit=B... unit=B/s→

..unit=ms..unit=ms.. unit=ms stat=mean→

→ unit=ms stat=upper_90

→ ...

GraphExplorer queries 101

site:api.vimeo.com unit=Req/s

requesthostport api_vimeo_com

Smoothing

avg over 10M

avg over ...

Aggregation, compare port 80 vs 443

avg by <dimension>

sum by <dimension>

sum by server

Compare 80 traffic amongt servers

site:api.vimeo.com unit=Req/s port=80 group by none avg over 10M

GraphExplorer queries 201

proxyserver swift server:regex upper_90 unit=ms from <datetime> to <datetime> avg over <timespec>

Compare object put/get

Stack .. http_method:(PUT|GET) swift_type=object avg by http_code,server

Comparing servers

http_method:(PUT|GET) avg by http_code,swift_type,http_method group by none

Compare http codes for GET, per swift type

http_method=GET avg by server group by swift_type

transcode unit=Job/s avg over <time> from <datetime> to <datetime>

Note: data is obfuscated

Compare job states per region (zones bucket)

group by zone

Unit conversion

unit=Mb/s network dfvimeorpc sum by server

unit=MB

{

server=dfvimeodfs1

plugin=diskspace

mountpoint=_srv_node_dfs5

unit=B

type=used

target_type=gauge

}

server:dfvimeodfs unit=GB type=free srv node

unit=GB/d group by mountpoint

Dashboard definition

queries = [

'cpu usage sum by core',

'mem unit=B !total group by type:swap',

'stack network unit=b/s',

'unit=B (free|used) group by =mountpoint'

]

stats.dfvimeocliapp2.twitter.error

{

“n1”: “dfvimeocliapp2”,

“n2”: “twitter”,

“n3”: “error”,

“plugin”: “catchall_statsd”,

“source”: “statsd”,

“target_type”: “rate”,

“unit”: “unknown/s”

}

Two hard things in computer science

stats.gauges.files.

id_boundary_7day

stats.gauges.files.

id_boundary_ceil

unit=File id_boundary_7d

{

“unit”: “File”,

“n1”: “id_boundary_7d”,

}

{

“intrinsic”: {

“site”: “vimeo.com”,

“unit”: “Req/s”

},

“extrinsic”: {

“agent”: “diamond”,

“processed_by”: “statsd1”,

“src”: “index.php:135”,

“replaces”: “vimeo_com_reqps”

}

}

site=vimeo.com unit=Req/s \

processed_by=statsd1 \ src=index.php:135 added_by=dieter \

123 1234567890

Equivalence

servers.host.cpu.total.iowait “core” : “_sum_”→

servers.host.cpu.<corenumber>.iowait

servers.host.loadavg.15

Rollups & aggregation

/etc/carbon/storageaggregation.conf[min]

pattern = \.min$

aggregationMethod = min

[max]

pattern = \.max$

aggregationMethod = max

[sum]

pattern = \.count$

aggregationMethod = sum

[default_average]

pattern = .*

aggregationMethod = average

2 kinds of graphite users

Selfdescribing metrics

stat=upper/lower/mean/...target_type=counter..

● stats.timers.render_time.histogram.bin_0.01● stats.timers.render_time.histogram.bin_0.1● stats.timers.render_time.histogram.bin_1 unit=Freq_abs bin_upper=1→

● stats.timers.render_time.histogram.bin_10● stats.timers.render_time.histogram.bin_50● stats.timers.render_time.histogram.bin_inf● stats.timers.render_time.lower unit=ms stat=lower→

● stats.timers.render_time.mean unit=ms stat=mean→

● stats.timers.render_time.mean_90 ...→

● stats.timers.render_time.median● stats.timers.render_time.std● stats.timers.render_time.upper● stats.timers.render_time.upper_90

Also..

● graphite API functions such as "cumulative", "summarize" and "smartSummarize"

● Graph renderers

From: dygraphs.com

Facet based suggestions

Metric types

● gauge● count & rate● counter● timer

gauge

● Multiple values in same interval● “sticky”

Count & Rate

Counter

Timer..

http://janabeck.com/blog/2012/10/12/lessonslearnedfrom100/

Timer..

● What should a metric be?● Stickyness?● Behavior on no packets received● Behavior on multiple packets received

My personal takeaways

Conclusion● Building graphs, setting up alerting cumbersome● Esp. changing information needs (troubleshooting, exploring, ..)● Esp. Complicated information needs

→ PAIN

● Structuring metrics● Selfdescribing metrics● Standardized metrics● Native metrics 2.0

● → BREEZE

Conclusion

● Metrics can be so much more usable and useful. Let's talk about tagging, standardisation, retaining information throughout the pipeline.

● Converting information needs into graph defs, alerting rules● GraphExplorer, carbontagger, statsdaemon, …● Graphiteng (native metrics 2.0)● Metrics 2.0 in your apps, agents, aggregators?● Build out structured metrics library

github.com/vimeo

github.com/Dieterbe

twitter.com/Dieter_be

dieter.plaetinck.be

metrics stack 2.0

Engineering

mean unit

lower unit

dfs5 unit

unitconversion unit

mean stats

upper stats

com stats

lower stats