monitoring with ganglia

66
Monitoring with Ganglia Vladimir Vuksan @vvuksan http://blog.vuksan.com/

Upload: fastly

Post on 15-Jan-2015

724 views

Category:

Documents


5 download

DESCRIPTION

June 24, 2014. At Velocity 2014, Fastly engineer Vladimir Vuksan gave an intro to Ganglia concepts (grid, clusters, hosts) as well as an installation of a sample monitoring grid. He also goes through the following commonly used visualization tools and how they may aid in detecting issues, identifying causes, and taking corrective action: - Cluster/Grid Views - Aggregate graphs - Compare Hosts - Custom graph functionality - Views - Interactive graphs - Trending - Nagios/Alerting system integration - How to add metrics to Ganglia - Different export formats such as JSON, CSV, and XML

TRANSCRIPT

Page 1: Monitoring with Ganglia

Monitoring with Ganglia

Vladimir Vuksan@vvuksan

http://blog.vuksan.com/

Page 2: Monitoring with Ganglia

Who am I

● Have done systems administration for over 20 years

● Ganglia contributor● Co-authored O'Reilly book about Ganglia● Work at Fastly

● @vvuksan on Twitter

Page 3: Monitoring with Ganglia

Ganglia book

Book signing Wednesday 6/25 at 10:45 in the O'Reilly Author booth

Page 4: Monitoring with Ganglia

What is Ganglia

● Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization

● Started in 2002

http://ganglia.info/

● Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization

● Started in 2002

● Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization

● Started in 2002

Page 5: Monitoring with Ganglia

How I got involved

● Got introduced to Ganglia in 2005● Loved it● In 2010 started working on rewrite of Ganglia

web UI● In 2011 became one of the Ganglia core

developers

Page 6: Monitoring with Ganglia

Tutorial outline

● Why Ganglia● Ganglia basics● Ganglia setup demo● Ganglia web UI demo● Choose your own adventure topics

Page 7: Monitoring with Ganglia

Why do we monitor

● Problem/issue detection MTTR/MTTD

● Trending – where are we going

● Learn how our infrastructure/system really behaves

Timezone difference between NZ and SYD is 2 hours=> People are predictable

Page 8: Monitoring with Ganglia

Why Ganglia ?

● Relatively easy to set up and track lots of metrics

● Doesn't impose heavy operational burden ie. most installs don't require multiple machines, proxies, Hbase etc.

● Doesn't require lots of work to provide me with tons of usable graphs

● Lots of power users geared features e.g. aggregate graphs, compare hosts, views

Page 9: Monitoring with Ganglia

Ganglia Architecture

● 2 daemons: gmond & gmetad

● gmond sends and/or receives metrics – keep in memory

● 1 gmetad per grid. polls 1 gmond per cluster for data.

● a node belongs to a cluster.a cluster belongs to a grid.

● Web UI a separate item use it or lose it

Page 10: Monitoring with Ganglia

Transport

● Gmonds talk to each other over UDP● Gmonds expose metrics over TCP as XML● Gmetad exposes metrics over TCP as XML

Page 11: Monitoring with Ganglia

Multicast vs. unicast transport

● Multicast is the default● Works great if in environments that are on a single

network segment e.g. compute grids, corporate networks

● Zero config● Doesn't work in cloud as multicast is filtered● Allows for some interesting implementations since all

nodes about metrics from all other nodes

● Use Unicast

Page 12: Monitoring with Ganglia

Write scaling using RRDcached

● If you have lots of metrics your I/O subsystem will likely become the bottleneck. Use SSDs and RRDcached (consolidates writes)

● RRDcached daemon on Ubuntu Debian /etc/default/rrdcached

OPTS=" -t 60 -w 180 -z 180 -F -s ganglia -m 664 \

-l 127.0.0.1:9998 -s ganglia -m 777 -P FLUSH,STATS,HELP \

-l unix:/tmp/rrdcached.limited.sock -b /var/lib/ganglia/rrds -B \

-p /var/lib/ganglia/rrdcached.pid"● Tell gmetad where to look

● Prior to 3.7.0+ environment variable– export RRDCACHED_ADDRESS=/tmp/rrdcached.sock

● In 3.7.0+ gmetad.conf setting– rrdcached_address 127.0.0.1:9998

● Tell Web UI where to look● $conf['rrdcached_socket'] = "unix:/tmp/rrdcached.limited.sock";

Page 13: Monitoring with Ganglia

Network buffers scaling

● You will need to increase your UDP buffer size. Default is 128k

● Bump it up in sysctl

sysctl -w net.core.rmem_max=15000000● Bump up conntrack for good measure

sysctl -w net.nf_conntrack_max=512000 ● In gmond.conf under udp_recv_channel add

buffer = 10000000

Page 14: Monitoring with Ganglia

Getting data in

● Via gmond modules, written in C or Python.● Varnish metrics, Apache metrics

● Via gmetric or libraries that implement the gmetric protocol.

● Via other daemons designed to feed metrics to ganglia (e.g. statsd)

Page 15: Monitoring with Ganglia

Zero metric configuration

● Just start sending new metrics.● gmetad will create a new RRD file for any new

metric it sees.● The web UI will draw a basic graph for every

metric.● You can create nice colored graphs later if you

want them.

Page 16: Monitoring with Ganglia

Gmond shenanigans

● One aggregating gmond required for each cluster

● Deficiency in the protocol :-(

Page 17: Monitoring with Ganglia

Demo setup

SFO gmondAggregatorPort=50001

SFO gmond sender AMS gmond sender NYC gmond sender

SFO gmondAggregatorPort=50002

SFO gmondAggregatorPort=50003

Gmetad pollerWeb UI

Page 18: Monitoring with Ganglia

Install

● On aggregator

apt-get -y install ganglia-monitor ganglia-monitor-python gmetad rrdtool ganglia-webfrontend

● On nodes

apt-get -y install ganglia-monitor ganglia-monitor-python

Page 19: Monitoring with Ganglia

Gmond configuration

● Separate aggregator and sender nodes

● We'll be using unicast

Page 20: Monitoring with Ganglia

Sender config

● Send metrics (global section)

mute = no

deaf = yes

● Remove any udp_recv_channels and tcp_accept_channels

● Ganglia sends metadata packets separately from metric packets. If you don't have metadata metrics will not show up. This becomes a problem if aggregator gets restarted. Not a problem in multicast settings where they can send each other messages requesting metadata but needs to be set in unicast. Set following in global section

send_metadata_interval = 60

Page 21: Monitoring with Ganglia

Aggregator config

● Receive metrics only (global section)

deaf = no

mute = yes

● Remove any udp_send_channels defined

Page 22: Monitoring with Ganglia

Node name determination

● Out of the box receiving/aggregator gmond will use reverse DNS resolution to determine hostname/node name for received metric packets

● Use

override_hostname = “my_hostname”● In global section to set the desired host name

Page 23: Monitoring with Ganglia

Zero configuration

● Just start sending new metrics.● gmetad will create a new RRD file for any new

metric it sees.● The web UI will draw a basic graph for every

metric.● You can create nice colored graphs later if you

want them.

Page 24: Monitoring with Ganglia

High availability setup

gmond.conf(unicast)

udp_send_channel { host = 1.2.3.4 port = 8649 }_channel

gmond.conf(unicast)

udp_send_channel { host = 9.8.7.6 port = 8649 }_channel

US aggregating gmond.conf

udp_recv_channel { port = 8649 }tcp_accept_channel { port = 8649}

EU aggregating gmond.conf

udp_recv_channel { port = 8649 }tcp_accept_channel { port = 8649}_channel

US gmetad.conf

data_source “cluster” 1.2.3.4el

EU gmetad.conf

data_source “cluster” 9.8.7.6el

Ganglia Web UI

Ganglia Web UI

DNS

Active

Failover

Page 25: Monitoring with Ganglia

Ganglia Demo

Page 26: Monitoring with Ganglia

Web UI tutorial

Page 27: Monitoring with Ganglia

Search

● Search as you type – shows matching hosts then metrics

Page 28: Monitoring with Ganglia

Views

● Arbitrary collection of graphs● Individual metrics● Composite graphs● Aggregate graphs

● How to add● Add through the web UI● Configure using JSON configuration files

Page 29: Monitoring with Ganglia

Views JSON config example

$ cat /var/lib/ganglia­web/conf/view_cpu_util.json 

{

  "view_name": "CPU utilization",

  "default_size": "medium",

  "items": [

    {

      "hostname": "aggregator",

      "metric": "cpu_idle",

      "vertical_label": "%",

      "title": "CPU Idle"

    }

  ],

  "view_type": "standard",

  "parent": null

}

Page 30: Monitoring with Ganglia

Aggregate graphs

● Easy composite graph creation● Requires

● Host regular expression● Metric regular expression

Page 31: Monitoring with Ganglia

Common regular expressions

● Show both bytes_in and bytes_out● bytes_(in|out)

● Show any metric that starts with bytes● ^bytes_

● Show only bytes_out and not varnish_bytes_out or bytes_out_compressed● ^bytes_out$

● Only hosts cache-5,cache-7 and cache-9● ^cache-(5|7|9)

● All hosts from cache-5 to cache-9● ^cache-[5-9]

● All hosts except ones starting with cache-t● ^cache-[^t]

Page 32: Monitoring with Ganglia

Compare hosts

● Compare a set of hosts defined by a regular expression across all common metrics

● Aggregate graphs on steroids● Will generate hundred/thousands of aggregate

graphs you can use for analysis

Page 33: Monitoring with Ganglia

Events

● View events/Add Events

Page 34: Monitoring with Ganglia

Add events API driven

● Use curl from init script or deploy script● curl -v "http://ganglia.server/api/events.php?

action=add&start_time=now&summary=Restart+of+daemon&host_regex=$HOSTNAME"

Page 35: Monitoring with Ganglia

Automatic rotation

● Aimed for ops team that need to continuously rotate metrics to help spot early signs of trouble.

● metrics will be rotated until the browser window is closed.

● If you have multiple monitors you can invoke different views to be rotated on different monitors.

Page 36: Monitoring with Ganglia

Live Dashboard

● Adaptation of Tasseo for Ganglia https://github.com/obfuscurity/tasseo

Page 37: Monitoring with Ganglia

Mobile view

● Mobile optimized view for Ganglia.● Intended for any mobile browsers supported by

jQueryMobile toolkit. This covers most WebKit implementations ie. Android, iPhone iOS, HP webOS and Blackberry OS 6+.

● Provides a better experience viewing Ganglia on your mobile phone by eliminating panning and zooming.

Page 38: Monitoring with Ganglia

UI components you can interact with in host view

Page 39: Monitoring with Ganglia

Add to view

Page 40: Monitoring with Ganglia

Inspect

● Interactive graph you can hover over, zoom●

Page 41: Monitoring with Ganglia

Trend

Page 42: Monitoring with Ganglia

Timeshift

Page 43: Monitoring with Ganglia

CSV and JSON export

● Export data from the graph you are just seeing for further processing e.g. spreadsheet

● Can be done to any image URL by appending either &csv=1 or &json=1

Page 44: Monitoring with Ganglia

XML export from Gmetad● curl http://localhost:8652/MYCLUSTER/pico.domain.com/load_one

<HOST NAME="pico.domain.com" IP="10.24.5.123" REPORTED="1403577908" TN="2" TMAX="20" DMAX="0" LOCATION="unspecified" GMOND_STARTED="1403557741" TAGS="">

<METRIC NAME="load_one" VAL="2.80" TYPE="float" UNITS=" " TN="62" TMAX="70" DMAX="0" SLOPE="both" SOURCE="gmond">

<EXTRA_DATA>

<EXTRA_ELEMENT NAME="GROUP" VAL="load"/>

<EXTRA_ELEMENT NAME="DESC" VAL="One minute load average"/>

<EXTRA_ELEMENT NAME="TITLE" VAL="One Minute Load Average"/>

</EXTRA_DATA>

</METRIC>

</HOST>

Page 45: Monitoring with Ganglia

Choose your own adventure

Nagios integrations/

Alerting

AdHoc Views Statsd

Ask anythingConfig options to tune

Export to other systems

Page 46: Monitoring with Ganglia

Nagios integration / Alerting

Page 47: Monitoring with Ganglia

Nagios integration/Alerting

● Implements Nagios checks using Ganglia

● You already have nearly all the data you need for alerting ie. current load, disk utilization etc.

● If it's something you are gonna alert you might want to trend it

● Provides for much richer alerts

– Use custom criteria other than over/under threshold e.g. percentage of combined values

– Check multiple values – make sure no one is currently working on a machine (indicated by presence of /etc/disabled file)

Page 48: Monitoring with Ganglia

Nagios integration cont'd

● Check a single metric

● alert if one minute load average is > 5         check_command           check_ganglia_metric!load_one!more!5

● alert if number of local IPs is not exactly 5          check_command           check_ganglia_metric!local_ips!notequal!5

● Check multiple metrics on a single host – check all disks        check_command       check_ganglia_multiple_metrics!

                      

                   disk_free_rootfs,less,10:disk_free_tmp,less,20

Page 49: Monitoring with Ganglia

Nagios integration cont'd

● Check multiple metrics on multiple hosts specified by a regex

● Useful in situations where failures occur rarely

● For example send to Ganglia number of failed disks in a disk array. Alert if on failure

 check_command     check_host_regex_ignore_unknowns!'.*'!failed_disks,more,0

● Result    # Services OK = 236, CRIT/UNK = 2 : 

        CRITICAL compute­4566.domain.com failed_disks = 1 disks, 

        CRITICAL git­0341.domain.com failed_disks = 1 disks

Page 50: Monitoring with Ganglia

Check value same everywhere

● Sometimes you need to assure that

● App revision is consistent across all servers – polling may be tricky due to firewalls, network partitions etc.

● You have deployed all config files

     check_command   check_value_same_everywhere!

       ^cache­|^varnish­!

      varnish_vcl_loaded

● ResultVCLs loaded are not the same on all hosts CRITICAL  CRIT varnish_vcl_loaded differs values  

          53 ( cache­1, cache­3, cache­4 ) 

          52 ( cache­2 )

Page 51: Monitoring with Ganglia

Files present

● Alerting systems will not alert on any machines that have following files present

● /etc/ganglia_silence● You will need to expose this as a metric

Page 52: Monitoring with Ganglia

Ad-Hoc Views

Page 53: Monitoring with Ganglia

Ad-Hoc views

● Define arbitrary views on the fly● Enable them in conf.php

● $conf['ad-hoc-views'] = true;

● Supply complete view JSON config as a GET or POST variable e.g.

&ad-hoc-view={"view_name": "CPU utilization",”default_size”: …

Page 54: Monitoring with Ganglia

Use ad-hoc views with Tasseo

● You can also use them for Tasseo as well e.g.● URL suffix

/ganglia2/tasseo.php?ad-hoc-view=

Page 55: Monitoring with Ganglia

Misc hacks

Page 56: Monitoring with Ganglia

Misc hacks

● Notify a chat channel of an average number of HTTP errors

MIN15AGO=`date --date="15 minutes ago" "+%s" ;

ERROR_RATE=`curl --silent "http://ganglia.domain.com/ganglia/graph.php?c=Web&h=webserver&v=&m=nginx_500&cs=$MIN15AGO&csv=1" | \

awk -F, '{sum+=$2} END { print "Average = ",sum/NR}'

# Send to HipChat

curl -d "room_id=ourRoom&from=Ganglia&message=Error Rate = $ERROR_RATE&color=red&notify=1" https://api.hipchat.com/v1/rooms/message?auth_token=AUTH_TOKEN_HERE&format=json

http://blog.vuksan.com/2012/04/06/

Page 57: Monitoring with Ganglia

Reporting

● You crazy ?

● Use Ganglia as a common one way bus

● Ganglia supports string metrics. Use them :-)

● Send out key applications version numbers, config hashes etc.

Page 58: Monitoring with Ganglia

Exports

Page 59: Monitoring with Ganglia

Graphite Export

● Make sure you use UDP transport to send out metrics to graphite. TCP doesn't perform as well. Enable following settings in gmetad.conf

carbon_server "my.graphite.box"

carbon_port 2003

carbon_protocol udp

● If you don't care for Ganglia Web UI. You can disable writing of RRDs

write_rrds off

Page 60: Monitoring with Ganglia

Memcache Export

● Add following in gmetad.conf

memcached_parameters "--SERVER=127.0.0.1 --POOL-MIN=10 --POOL-MAX=70"

Page 61: Monitoring with Ganglia

Riemann Export

● Riemann is a powerful event stream processor● To enable

riemann_server "my.riemann.box"

riemann_port 5555

Page 62: Monitoring with Ganglia

Statsd implementations

● Pystatsd● Built in support for Ganglia● https://github.com/sivy/pystatsd/

● Etsy statsd● You need pluggable statsd backend● https://github.com/jbuchbinder/statsd-ganglia-backe

nd

Page 63: Monitoring with Ganglia

Tuning

Page 64: Monitoring with Ganglia

Config options to tune

● Add override config options in conf.php (overrides anything in conf_default.php)

● Remove stats from graph legend

$conf['graphreport_stats'] = false;

● Change default metric that shows up. Default load_one

$conf['default_metric'] = "cpu_report";

● Disable authentication – enables view and event creation (if you are behind firewall/basic auth)

$conf['auth_system'] = 'disabled';

● Don't show all host metrics by default.

$conf['metric_groups_initially_collapsed'] = true;

Page 65: Monitoring with Ganglia

Config options to tune

● Change default time ranges

$conf['time_ranges'] = array(

'hour'=>3600,

'2hr'=>7200,

'4hr'=>14400,

'day'=>86400,

'week'=>604800,

'month'=>2419200);

Page 66: Monitoring with Ganglia

Links

● Ganglia Github repos● http://github.com/ganglia/