scaling graphite to handle a zerg rush

Scaling graphite to handle a zerg rush

December 11, 2016 | Daniel Ben-Zvi, VP of R&D, SaaS Platform

[email protected]

Mission

The problem

The problemNo metrics across the board

●Hard to debug issues

●No intuitive way to measure efficiency, usage

●Capacity planning?

●Dashboards

The problemNo metrics across the board

●We knew graphite

●We wanted statsd for applicative metrics

●And we heard that collectd is-nice and we installed it X 500 physical machines

Mission

Graphite

Graphite

Write throughput across our Hadoop fleet

Ingress traffic to our load balancing layer

"Store numeric time series data""Render graphs of this data on demand"

GraphiteArchitecture

Image from: github.com/graphite-project/graphite-web/blob/master/README.md

https://github.com/graphite-project/graphite-web/blob/master/README.md

Mission

So why did it crash?

Max IOPS reached

Single Threaded

Graphite●First setup - 2x 1TB magnetic drives @ RAID 1

●Volume peaked at ˜300 iops

●Carbon-cache maxed the CPU

Avishai Ish-Shalom

small fonts on the diagram, not visible on big screen

Graphite●Why so many IOPS?

●Every metric is a separate file on the FS

/var/data/graphite/collectd/{hostname}/cpu/user.wsp

Mission

Solving the problem

Mission

Graphite+ Clustering

https://grey-boundary.io/the-architecture-of-clustering-graphite/


Graphite+ Clustering


This looks nice but do we really need moar machines?


Graphite+ Remember the bottlenecks we

had

●Carbon-cache reached 100% CPU on a single core (it's probably single threaded)

●Disks reached maximum IOPS capacity

Mission

carbon-cache

Graphite+ Carbon-cache

●Persists metrics to disk and serves hot-cache to graphite

●Python, single threaded

●So we replaced carbon-cache with go-carbon:Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister

Avishai Ish-Shalom

A word about carbon-cache adaptive batch writing perhaps?

Graphite+ go-carbon

The result of replacing "carbon" to "go-carbon" on a server with a load up to 900 thousand metric per minute:

Reference: https://github.com/lomik/go-carbon

Graphite+ go-carbon

Max IOPS reached

20% cpu

x500

Mission

Solving the IOPS bottleneck

Graphite+ IOPS

RAID 0? Raid controller became the bottleneck and it wasn't enough anyway

SSD? Yes! But one wasn't enough :(

Hadoop inspiration! JBOD (no raid)

Influx? No!

Graphite+ We wanted this:

Mission

Load balancer: carbon-relay

Graphite+ carbon-relay

●"Load balancer" between metric producers and go-carbon instances

●Same metric is routed to the same go-carbon instance via a consistent hashing algorithm

●But… is a single-threaded Python app so your mileage may vary

Graphite+ IOPS

100% CPU :(

Graphite+ carbon-relay

●We replaced with carbon-c-relay:A very fast C implementation of carbon-relay (and much more)

Graphite+ Carbon C relay

Graphite+ (Some) Performance metrics

Go-carbon Update Operations Stack CPU usage

Mission

What about statsd?

+ statsd

Can we scale statsd out?

Graphite

+ statsd

Who wins?

If we shard statsd, we end up with wrong data in graphite.

Graphite

Mission

Introducing statsiteC implementation of

statsd (and much more)

Graphite

●Wire compatible with statsd (drop in replacement)

●Pure C with a tight event loop (very fast)

●Low memory footprint

●Supports quantiles, histograms and much more.

+ Statsite

Mission

Final setup

Graphite+ Final setup

“Graphite box”

Mission

Don’t give up on Graphite!

Mission

Recap

Graphite

●Beast graphite stack, peaked at 1M updates per minute, room for more

●Very efficient: ˜10% user-land CPU usage, leaves more room for IRQs (disk, network)

●We can still scale out the whole stacks with another layer of carbon-c-relay but we never needed to go there.

+ Pros

Graphite

●SSD is still expensive and wears out quickly under heavy random-writes scenarios - less relevant on AWS :-)

●Bugs - Custom components are somewhat less field tested.

●Data is not highly available with JBOD

●Doing metrics right is demanding - go SaaS!

+ Cons

Graphite+ Some tuning tips

● UDP creates correlated loss and has shitty backpressure behaviour (actually, NO backpressure). Use TCP when possible

● High frequency UDP packets (statsite) can generate a shit-load of IRQs - balance your interrupts or enforce affinity

● High Carbon PPU (Points per update) signals I/O latency

● Tune go-carbon cache, especially if you alert on metrics

● https://github.com/lomik/go-carbon● https://github.com/grobian/carbon-c-relay● https://github.com/statsite/statsite

●https://github.com/similarweb/puppet-go_carbon● http://www.aosabook.org/en/graphite.html

Links

https://github.com/lomik/go-carbon

https://github.com/grobian/carbon-c-relay

https://github.com/statsite/statsite

https://github.com/similarweb/puppet-go_carbon

http://www.aosabook.org/en/graphite.html

We are hiring :-)

Thank You!

scaling graphite to handle a zerg rush

Engineering