distributed monitoring

Leon TorresOctober 15, 2014

Web Startup Challenges

• Low-friction development

• Hodgepodge of technologies

• Hodgepodge of infrastructures

• Legacy support

• Constant migrations and upgrades

• Bottom line:

High rate of change and no time to check!

A Gordian Knot

• How utilized is our Hadoop cluster?

• How utilized is our DC?

• Are all of our services running correctly?

• Is our latency OK at every layer in the stack?

• Someone changed something, were there any negative ripple effects?

• Are we hitting any scaling issues?

A Network Knot

• Our products live on the internet

• Our data centers are global– Some of them are virtual

• Network effects are a fact of life– Network partitions

– Latency makes information late

– Noise is natural and frequent

– Data just goes missing

– High availability compounds the problem

– Richard W. Hamming

Solution Design

• Hypothesize existence of

system statea time varying stream of state components

• Build it by measuring our systems in toto

• Stream all measurements to one place

• Gain insight by inspecting this stream computationally and ad-hoc

Separation of Concerns

• State collection

• State computation

• State visualization

Collecting Sate

• Define a state event ADT capturing:

– Host

– Service

– State

– Timestamp

– Any additional key/value fields

• Find something to collect it

Riemann

• Riemann accepts state events as a stream

• Riemann indexes the stream, provides stream processing facilities and some alerting tools

• Also provides downstream pipes:

– Unix domain sockets

– Web sockets

– Graphite stream comes free

– Create your own

http://riemann.io/concepts.html

Innternal State Relays

• Poll third party monitors for state

• Map to Riemann events

• Send to Riemann

• Fill in holes with custom monitors

– Hadoop jobs, load balancer state, etc.

• Foundation in place to know everything about our global DC state

Network Monitors

• Static monitors around the world

– Constantly check HTTP state of services

• Poll third party monitors (Pingdom, etc.)

• Deduce network state from aggregate streams

• Detect outages from user perspective

• Can extend with phantomjs to get Gomez like waterfall and do whatever we want!

Demo Time

• Ad hoc demo

– Grep the stream

– Quickly analyze state of disk utilization

• Hadoop global state

– It just pipes nagios data!

• Network monitoring demo

– Let’s combine pingdom + network monitors

– And iterate! awesome dashboard

http://cyclops.vpc.supplyframe.com:3000/grep

http://cyclops.vpc.supplyframe.com:3000/disk

http://new-graphite.vpc.supplyframe.com/dashboard/#riemann-hadoop

http://cyclops.vpc.supplyframe.com:3000/pingdom

http://cyclops.vpc.supplyframe.com:3001/

Distributed Gotchas

• Riemann can scale, but some nasty surprises

– Events on a TCP connection are processed serially

– If event rate gets too high, stream gets saturated and backs up into OS network buffers, then into Netty’s unbounded buffers. This ultimately starves heap and crashes Riemann.

– Solution is to use large connection pools at the clients that push events

Distributed Gotchas

• Network outages and partitions are difficult

– Riemann must not go down

– Riemann must deal with split-brain

• Highly available SRE solution planned

– Virtual ip, heartbeat (similar to LB solution)

• Riemann servers in separate locations

– End up with two masters on partition => double the alerts but at least we get something

Are we cutting the knot?

distributed monitoring

Technology

fundamental problem

overall system

placegain insight

distributed nature

existing solutions

latency ok

certain things

deep look