scaling graphite to handle a zerg rush
TRANSCRIPT
Scaling graphite to handle a zerg rush
December 11, 2016 | Daniel Ben-Zvi, VP of R&D, SaaS Platform
Mission
The problem
The problemNo metrics across the board
●Hard to debug issues
●No intuitive way to measure efficiency, usage
●Capacity planning?
●Dashboards
The problemNo metrics across the board
●We knew graphite
●We wanted statsd for applicative metrics
●And we heard that collectd is-nice and we installed it X 500 physical machines
Mission
Graphite
Graphite
Write throughput across our Hadoop fleet
Ingress traffic to our load balancing layer
"Store numeric time series data""Render graphs of this data on demand"
GraphiteArchitecture
Image from: github.com/graphite-project/graphite-web/blob/master/README.md
Mission
So why did it crash?
Max IOPS reached
Single Threaded
Graphite●First setup - 2x 1TB magnetic drives @ RAID 1
●Volume peaked at ˜300 iops
●Carbon-cache maxed the CPU
Graphite●Why so many IOPS?
●Every metric is a separate file on the FS
/var/data/graphite/collectd/{hostname}/cpu/user.wsp
Mission
Solving the problem
Mission
Graphite+ Clustering
https://grey-boundary.io/the-architecture-of-clustering-graphite/
Graphite+ Clustering
https://grey-boundary.io/the-architecture-of-clustering-graphite/
This looks nice but do we really need moar machines?
Graphite+ Remember the bottlenecks we
had
●Carbon-cache reached 100% CPU on a single core (it's probably single threaded)
●Disks reached maximum IOPS capacity
Mission
carbon-cache
Graphite+ Carbon-cache
●Persists metrics to disk and serves hot-cache to graphite
●Python, single threaded
●So we replaced carbon-cache with go-carbon:Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister
Graphite+ go-carbon
The result of replacing "carbon" to "go-carbon" on a server with a load up to 900 thousand metric per minute:
Reference: https://github.com/lomik/go-carbon
Graphite+ go-carbon
Max IOPS reached
20% cpu
x500
Mission
Solving the IOPS bottleneck
Graphite+ IOPS
RAID 0? Raid controller became the bottleneck and it wasn't enough anyway
SSD? Yes! But one wasn't enough :(
Hadoop inspiration! JBOD (no raid)
Influx? No!
Graphite+ We wanted this:
Mission
Load balancer: carbon-relay
Graphite+ carbon-relay
●"Load balancer" between metric producers and go-carbon instances
●Same metric is routed to the same go-carbon instance via a consistent hashing algorithm
●But… is a single-threaded Python app so your mileage may vary
Graphite+ IOPS
100% CPU :(
Graphite+ carbon-relay
●We replaced with carbon-c-relay:A very fast C implementation of carbon-relay (and much more)
Graphite+ Carbon C relay
Graphite+ (Some) Performance metrics
Go-carbon Update Operations Stack CPU usage
Mission
What about statsd?
+ statsd
Can we scale statsd out?
Graphite
+ statsd
Who wins?
If we shard statsd, we end up with wrong data in graphite.
Graphite
Mission
Introducing statsiteC implementation of
statsd (and much more)
Graphite
●Wire compatible with statsd (drop in replacement)
●Pure C with a tight event loop (very fast)
●Low memory footprint
●Supports quantiles, histograms and much more.
+ Statsite
Mission
Final setup
Graphite+ Final setup
“Graphite box”
Mission
Don’t give up on Graphite!
Mission
Recap
Graphite
●Beast graphite stack, peaked at 1M updates per minute, room for more
●Very efficient: ˜10% user-land CPU usage, leaves more room for IRQs (disk, network)
●We can still scale out the whole stacks with another layer of carbon-c-relay but we never needed to go there.
+ Pros
Graphite
●SSD is still expensive and wears out quickly under heavy random-writes scenarios - less relevant on AWS :-)
●Bugs - Custom components are somewhat less field tested.
●Data is not highly available with JBOD
●Doing metrics right is demanding - go SaaS!
+ Cons
Graphite+ Some tuning tips
● UDP creates correlated loss and has shitty backpressure behaviour (actually, NO backpressure). Use TCP when possible
● High frequency UDP packets (statsite) can generate a shit-load of IRQs - balance your interrupts or enforce affinity
● High Carbon PPU (Points per update) signals I/O latency
● Tune go-carbon cache, especially if you alert on metrics
● https://github.com/lomik/go-carbon● https://github.com/grobian/carbon-c-relay● https://github.com/statsite/statsite
●https://github.com/similarweb/puppet-go_carbon● http://www.aosabook.org/en/graphite.html
Links
We are hiring :-)
Thank You!