distributed monitoring
TRANSCRIPT
Leon TorresOctober 15, 2014
Web Startup Challenges
• Low-friction development
• Hodgepodge of technologies
• Hodgepodge of infrastructures
• Legacy support
• Constant migrations and upgrades
• Bottom line:
High rate of change and no time to check!
A Gordian Knot
• How utilized is our Hadoop cluster?
• How utilized is our DC?
• Are all of our services running correctly?
• Is our latency OK at every layer in the stack?
• Someone changed something, were there any negative ripple effects?
• Are we hitting any scaling issues?
A Network Knot
• Our products live on the internet
• Our data centers are global– Some of them are virtual
• Network effects are a fact of life– Network partitions
– Latency makes information late
– Noise is natural and frequent
– Data just goes missing
– High availability compounds the problem
– Richard W. Hamming
Solution Design
• Hypothesize existence of
system statea time varying stream of state components
• Build it by measuring our systems in toto
• Stream all measurements to one place
• Gain insight by inspecting this stream computationally and ad-hoc
Separation of Concerns
• State collection
• State computation
• State visualization
Collecting Sate
• Define a state event ADT capturing:
– Host
– Service
– State
– Timestamp
– Any additional key/value fields
• Find something to collect it
Riemann
• Riemann accepts state events as a stream
• Riemann indexes the stream, provides stream processing facilities and some alerting tools
• Also provides downstream pipes:
– Unix domain sockets
– Web sockets
– Graphite stream comes free
– Create your own
Innternal State Relays
• Poll third party monitors for state
• Map to Riemann events
• Send to Riemann
• Fill in holes with custom monitors
– Hadoop jobs, load balancer state, etc.
• Foundation in place to know everything about our global DC state
Network Monitors
• Static monitors around the world
– Constantly check HTTP state of services
• Poll third party monitors (Pingdom, etc.)
• Deduce network state from aggregate streams
• Detect outages from user perspective
• Can extend with phantomjs to get Gomez like waterfall and do whatever we want!
Demo Time
• Ad hoc demo
– Grep the stream
– Quickly analyze state of disk utilization
• Hadoop global state
– It just pipes nagios data!
• Network monitoring demo
– Let’s combine pingdom + network monitors
– And iterate! awesome dashboard
Distributed Gotchas
• Riemann can scale, but some nasty surprises
– Events on a TCP connection are processed serially
– If event rate gets too high, stream gets saturated and backs up into OS network buffers, then into Netty’s unbounded buffers. This ultimately starves heap and crashes Riemann.
– Solution is to use large connection pools at the clients that push events
Distributed Gotchas
• Network outages and partitions are difficult
– Riemann must not go down
– Riemann must deal with split-brain
• Highly available SRE solution planned
– Virtual ip, heartbeat (similar to LB solution)
• Riemann servers in separate locations
– End up with two masters on partition => double the alerts but at least we get something
Are we cutting the knot?