scaling graphite to handle a zerg rush

43
Scaling graphite to handle a zerg rush December 11, 2016 | Daniel Ben-Zvi, VP of R&D, SaaS Platform [email protected]

Upload: daniel-ben-zvi

Post on 21-Apr-2017

98 views

Category:

Engineering


6 download

TRANSCRIPT

Page 1: Scaling graphite to handle a zerg rush

Scaling graphite to handle a zerg rush

December 11, 2016 | Daniel Ben-Zvi, VP of R&D, SaaS Platform

[email protected]

Page 2: Scaling graphite to handle a zerg rush

Mission

The problem

Page 3: Scaling graphite to handle a zerg rush

The problemNo metrics across the board

●Hard to debug issues

●No intuitive way to measure efficiency, usage

●Capacity planning?

●Dashboards

Page 4: Scaling graphite to handle a zerg rush

The problemNo metrics across the board

●We knew graphite

●We wanted statsd for applicative metrics

●And we heard that collectd is-nice and we installed it X 500 physical machines

Page 5: Scaling graphite to handle a zerg rush

Mission

Graphite

Page 6: Scaling graphite to handle a zerg rush

Graphite

Write throughput across our Hadoop fleet

Ingress traffic to our load balancing layer

"Store numeric time series data""Render graphs of this data on demand"

Page 7: Scaling graphite to handle a zerg rush

GraphiteArchitecture

Image from: github.com/graphite-project/graphite-web/blob/master/README.md

Page 8: Scaling graphite to handle a zerg rush

Mission

So why did it crash?

Page 9: Scaling graphite to handle a zerg rush

Max IOPS reached

Single Threaded

Graphite●First setup - 2x 1TB magnetic drives @ RAID 1

●Volume peaked at ˜300 iops

●Carbon-cache maxed the CPU

Avishai Ish-Shalom
small fonts on the diagram, not visible on big screen
Page 10: Scaling graphite to handle a zerg rush

Graphite●Why so many IOPS?

●Every metric is a separate file on the FS

/var/data/graphite/collectd/{hostname}/cpu/user.wsp

Page 11: Scaling graphite to handle a zerg rush

Mission

Solving the problem

Page 12: Scaling graphite to handle a zerg rush

Mission

Page 13: Scaling graphite to handle a zerg rush

Graphite+ Clustering

https://grey-boundary.io/the-architecture-of-clustering-graphite/

Page 14: Scaling graphite to handle a zerg rush

Graphite+ Clustering

https://grey-boundary.io/the-architecture-of-clustering-graphite/

This looks nice but do we really need moar machines?

Page 15: Scaling graphite to handle a zerg rush

Graphite+ Remember the bottlenecks we

had

●Carbon-cache reached 100% CPU on a single core (it's probably single threaded)

●Disks reached maximum IOPS capacity

Page 16: Scaling graphite to handle a zerg rush

Mission

carbon-cache

Page 17: Scaling graphite to handle a zerg rush

Graphite+ Carbon-cache

●Persists metrics to disk and serves hot-cache to graphite

●Python, single threaded

●So we replaced carbon-cache with go-carbon:Golang implementation of Graphite/Carbon server with classic architecture: Agent -> Cache -> Persister

Avishai Ish-Shalom
A word about carbon-cache adaptive batch writing perhaps?
Page 18: Scaling graphite to handle a zerg rush

Graphite+ go-carbon

The result of replacing "carbon" to "go-carbon" on a server with a load up to 900 thousand metric per minute:

Reference: https://github.com/lomik/go-carbon

Page 19: Scaling graphite to handle a zerg rush

Graphite+ go-carbon

Max IOPS reached

20% cpu

x500

Page 20: Scaling graphite to handle a zerg rush

Mission

Solving the IOPS bottleneck

Page 21: Scaling graphite to handle a zerg rush

Graphite+ IOPS

RAID 0? Raid controller became the bottleneck and it wasn't enough anyway

SSD? Yes! But one wasn't enough :(

Hadoop inspiration! JBOD (no raid)

Influx? No!

Page 22: Scaling graphite to handle a zerg rush

Graphite+ We wanted this:

Page 23: Scaling graphite to handle a zerg rush

Mission

Load balancer: carbon-relay

Page 24: Scaling graphite to handle a zerg rush

Graphite+ carbon-relay

●"Load balancer" between metric producers and go-carbon instances

●Same metric is routed to the same go-carbon instance via a consistent hashing algorithm

●But… is a single-threaded Python app so your mileage may vary

Page 25: Scaling graphite to handle a zerg rush

Graphite+ IOPS

100% CPU :(

Page 26: Scaling graphite to handle a zerg rush

Graphite+ carbon-relay

●We replaced with carbon-c-relay:A very fast C implementation of carbon-relay (and much more)

Page 27: Scaling graphite to handle a zerg rush

Graphite+ Carbon C relay

Page 28: Scaling graphite to handle a zerg rush

Graphite+ (Some) Performance metrics

Go-carbon Update Operations Stack CPU usage

Page 29: Scaling graphite to handle a zerg rush

Mission

What about statsd?

Page 30: Scaling graphite to handle a zerg rush

+ statsd

Can we scale statsd out?

Graphite

Page 31: Scaling graphite to handle a zerg rush

+ statsd

Who wins?

If we shard statsd, we end up with wrong data in graphite.

Graphite

Page 32: Scaling graphite to handle a zerg rush

Mission

Introducing statsiteC implementation of

statsd (and much more)

Page 33: Scaling graphite to handle a zerg rush

Graphite

●Wire compatible with statsd (drop in replacement)

●Pure C with a tight event loop (very fast)

●Low memory footprint

●Supports quantiles, histograms and much more.

+ Statsite

Page 34: Scaling graphite to handle a zerg rush

Mission

Final setup

Page 35: Scaling graphite to handle a zerg rush

Graphite+ Final setup

“Graphite box”

Page 36: Scaling graphite to handle a zerg rush

Mission

Don’t give up on Graphite!

Page 37: Scaling graphite to handle a zerg rush

Mission

Recap

Page 38: Scaling graphite to handle a zerg rush

Graphite

●Beast graphite stack, peaked at 1M updates per minute, room for more

●Very efficient: ˜10% user-land CPU usage, leaves more room for IRQs (disk, network)

●We can still scale out the whole stacks with another layer of carbon-c-relay but we never needed to go there.

+ Pros

Page 39: Scaling graphite to handle a zerg rush

Graphite

●SSD is still expensive and wears out quickly under heavy random-writes scenarios - less relevant on AWS :-)

●Bugs - Custom components are somewhat less field tested.

●Data is not highly available with JBOD

●Doing metrics right is demanding - go SaaS!

+ Cons

Page 40: Scaling graphite to handle a zerg rush

Graphite+ Some tuning tips

● UDP creates correlated loss and has shitty backpressure behaviour (actually, NO backpressure). Use TCP when possible

● High frequency UDP packets (statsite) can generate a shit-load of IRQs - balance your interrupts or enforce affinity

● High Carbon PPU (Points per update) signals I/O latency

● Tune go-carbon cache, especially if you alert on metrics

Page 41: Scaling graphite to handle a zerg rush

● https://github.com/lomik/go-carbon● https://github.com/grobian/carbon-c-relay● https://github.com/statsite/statsite

●https://github.com/similarweb/puppet-go_carbon● http://www.aosabook.org/en/graphite.html

Links

Page 42: Scaling graphite to handle a zerg rush

We are hiring :-)

Page 43: Scaling graphite to handle a zerg rush

Thank You!