metrics stack 2.0
DESCRIPTION
Most metrics systems link timeseries to a string key, some add a few tags. They often lack information, use inconsistent formats and terminology, and are poorly organized. As the amount of people and software generating, processing, storing and visualizing metrics grows, this approach becomes very cumbersome and there is a lot to be gained from taking a step back and re-thinking metric identifiers and metadata. Metrics 2.0 is a set of conventions around metrics: With barely any extra work metrics become self-describing and standardized. Compatibility between tools increases dramatically, dashboards can automatically convert information needs into graphs, graph renderers can present data more usefully, anomaly detection and aggregators can work more autonomously and avoid common mistakes. Result: less micromanaging of software and configuration, quicker results, more clarity. Less frustration and room for errors. This talk will also cover the tools that turn this concept into production-ready reality: Graph-Explorer is an application that integrates with Graphite. Enter an expression that represents an information need and it generates the corresponding graphs or alerting rules, automatically applying unit conversion, aggregation, processing, etc. Statsdaemon is an aggregation daemon like Etsy's Statsd that expresses performed aggregations and statistical operations by updating the metrics tags, making sure that the metric metadata always corresponds to the data. Dieter Plaetinck is a systems-gone-backend engineer at Vimeo.TRANSCRIPT
Credit: user niteroi @ panoramio.com
vimeo.com/43800150
1 Metrics 2.0 concepts
2 Implementation
3 Advanced stuff
“Dieter” ?
Peter Deter→
Terminology sync
(1234567890, 82)
(1234567900, 123)
(1234567910, 109)
(1234567920, 77)
db15.mysql.queries_running
host=db15 mysql.queries_running
How many pagerequests/s is vimeo.com doing?
● stats.hits.vimeo_com
● stats_counts.hits.vimeo_com
stats.<host>.requesthostport.vimeo_com_443
stats.timers.dfs5.proxyserver.object.GET.200.timing.upper_90
O(X*Y*Z)X = # apps
Y = # people
Z = # aggregators
How long does it take to retrieve an object from swift?
stats.timers.<host>.proxyserver.<swift_type>.<http_method>.<http_code>.timing.<stat>
stats.timers.<host>.objectserver.<http_method>.timing.<stat>
target=stats.timers.dfs*.object*GET*timing.mean ?
target=groupByNode(stats.timers.dfs*.proxyserver.object.GET.*.timing.mean,2,"avg")
target=stats.timers.dfs*.objectserver.GET.timing.mean
swift_type=object stat=mean timing GET avg by http_code
O((DxV)^2)D = # dimensions
V = # values per dim
collectd.db.disk.sda1.disk_time.write
What should I name my metric?
101001000
100001000001000000
Metrics 2.0
Old:● information lacking
● fields unclear & inconsistent
● cumbersome strings / trees
● forbidden characters
New:● Selfdescribing
● Standardized
● all dimensions in orthogonal tagspace
● Allow some useful characters
stats.timers.dfs5.proxyserver.object.GET.200.timing.upper_90
{ “server”: “dfvimeodfsproxy5”, “http_method”: “GET”, “http_code”: “200”, “unit”: “ms”, “target_type”: “gauge”, “stat”: “upper_90”, “swift_type”: “object” “plugin”: “swift_proxy_server”}
Main advantages:● Immediate understanding of metric meaning (ideally)
● Minimize time to graphs, dashboards, alerting rules
github.com/vimeo/graphexplorer/wiki
SI + IEC
B Err Warn Conn Job File Req ...
MB/s Err/d Req/h ...
{
“site”: “vimeo.com”,
“port”: 80,
“unit”: “Req/s”,
“direction”: “in”,
“service”: “webapp_php”,
“server”: “webxx”
}
Carbontagger:
... service=foo.instance=host.target_type=gauge.type=calculation.unit=B 123 1234567890
…
Statsdaemon:
..unit=B..unit=B... unit=B/s→
..unit=ms..unit=ms.. unit=ms stat=mean→
→ unit=ms stat=upper_90
→ ...
GraphExplorer queries 101
site:api.vimeo.com unit=Req/s
requesthostport api_vimeo_com
Smoothing
avg over 10M
avg over ...
Aggregation, compare port 80 vs 443
avg by <dimension>
sum by <dimension>
sum by server
Compare 80 traffic amongt servers
site:api.vimeo.com unit=Req/s port=80 group by none avg over 10M
GraphExplorer queries 201
proxyserver swift server:regex upper_90 unit=ms from <datetime> to <datetime> avg over <timespec>
Compare object put/get
Stack .. http_method:(PUT|GET) swift_type=object avg by http_code,server
Comparing servers
http_method:(PUT|GET) avg by http_code,swift_type,http_method group by none
Compare http codes for GET, per swift type
http_method=GET avg by server group by swift_type
transcode unit=Job/s avg over <time> from <datetime> to <datetime>
Note: data is obfuscated
Bucketing
!queue sum by zone:apsoutheast|euwest|useast|uswest|saeast|vimeodf|vimeolv group by state
Note: data is obfuscated
Compare job states per region (zones bucket)
group by zone
Note: data is obfuscated
Unit conversion
unit=Mb/s network dfvimeorpc sum by server
unit=MB
{
server=dfvimeodfs1
plugin=diskspace
mountpoint=_srv_node_dfs5
unit=B
type=used
target_type=gauge
}
server:dfvimeodfs unit=GB type=free srv node
unit=GB/d group by mountpoint
Dashboard definition
queries = [
'cpu usage sum by core',
'mem unit=B !total group by type:swap',
'stack network unit=b/s',
'unit=B (free|used) group by =mountpoint'
]
stats.dfvimeocliapp2.twitter.error
{
“n1”: “dfvimeocliapp2”,
“n2”: “twitter”,
“n3”: “error”,
“plugin”: “catchall_statsd”,
“source”: “statsd”,
“target_type”: “rate”,
“unit”: “unknown/s”
}
Two hard things in computer science
stats.gauges.files.
id_boundary_7day
stats.gauges.files.
id_boundary_ceil
unit=File id_boundary_7d
{
“unit”: “File”,
“n1”: “id_boundary_7d”,
}
{
“intrinsic”: {
“site”: “vimeo.com”,
“unit”: “Req/s”
},
“extrinsic”: {
“agent”: “diamond”,
“processed_by”: “statsd1”,
“src”: “index.php:135”,
“replaces”: “vimeo_com_reqps”
}
}
site=vimeo.com unit=Req/s \
processed_by=statsd1 \ src=index.php:135 added_by=dieter \
123 1234567890
Equivalence
servers.host.cpu.total.iowait “core” : “_sum_”→
servers.host.cpu.<corenumber>.iowait
servers.host.loadavg.15
Rollups & aggregation
/etc/carbon/storageaggregation.conf[min]
pattern = \.min$
aggregationMethod = min
[max]
pattern = \.max$
aggregationMethod = max
[sum]
pattern = \.count$
aggregationMethod = sum
[default_average]
pattern = .*
aggregationMethod = average
2 kinds of graphite users
Selfdescribing metrics
stat=upper/lower/mean/...target_type=counter..
● stats.timers.render_time.histogram.bin_0.01● stats.timers.render_time.histogram.bin_0.1● stats.timers.render_time.histogram.bin_1 unit=Freq_abs bin_upper=1→
● stats.timers.render_time.histogram.bin_10● stats.timers.render_time.histogram.bin_50● stats.timers.render_time.histogram.bin_inf● stats.timers.render_time.lower unit=ms stat=lower→
● stats.timers.render_time.mean unit=ms stat=mean→
● stats.timers.render_time.mean_90 ...→
● stats.timers.render_time.median● stats.timers.render_time.std● stats.timers.render_time.upper● stats.timers.render_time.upper_90
Also..
● graphite API functions such as "cumulative", "summarize" and "smartSummarize"
● Graph renderers
From: dygraphs.com
Facet based suggestions
Metric types
● gauge● count & rate● counter● timer
gauge
● Multiple values in same interval● “sticky”
Count & Rate
Counter
Timer..
http://janabeck.com/blog/2012/10/12/lessonslearnedfrom100/
Timer..
● What should a metric be?● Stickyness?● Behavior on no packets received● Behavior on multiple packets received
My personal takeaways
Conclusion● Building graphs, setting up alerting cumbersome● Esp. changing information needs (troubleshooting, exploring, ..)● Esp. Complicated information needs
→ PAIN
● Structuring metrics● Selfdescribing metrics● Standardized metrics● Native metrics 2.0
● → BREEZE
Conclusion
● Metrics can be so much more usable and useful. Let's talk about tagging, standardisation, retaining information throughout the pipeline.
● Converting information needs into graph defs, alerting rules● GraphExplorer, carbontagger, statsdaemon, …● Graphiteng (native metrics 2.0)● Metrics 2.0 in your apps, agents, aggregators?● Build out structured metrics library
github.com/vimeo
github.com/Dieterbe
twitter.com/Dieter_be
dieter.plaetinck.be