god - process and task monitoring done right
TRANSCRIPT
FAILWHALE NEEDSNO INTRODUCTION
Like it or not, the web is 24/7/365
But who wants to be online 24/7/365?
Sometimes, you’ve just gotta take a walk
ZOMG WHAT NOW?
Process monitoring
sudo gem install god
TomPreston-Warner
written by:
git clone git://github.com/jnewland/god_examples.git Follow along at home
The Basics
$ ruby scripts/crashy.rb Wed Jul 09 13:53:13 -0400 2008Wed Jul 09 13:53:14 -0400 2008Wed Jul 09 13:53:15 -0400 2008/Users/jnewland/src/god_examples/lib/god_test.rb:28:in `crash': Crash! (RuntimeError) from /Users/jnewland/src/god_examples/lib/god_test.rb:20:in `run' from /Users/jnewland/src/god_examples/lib/god_test.rb:19:in `loop' from /Users/jnewland/src/god_examples/lib/god_test.rb:19:in `run' from /Users/jnewland/src/god_examples/lib/god_test.rb:15:in `initialize' from scripts/crashy.rb:4:in `new' from scripts/crashy.rb:4
#simple.god#The simplest possible watchGod.watch do |w| w.name = 'crashy' w.interval = 1.seconds w.start = 'ruby scripts/crashy.rb'
w.start_if do |start| start.condition(:process_running) do |c| c.running = false end endend
$ god -h
...
Options: -c, --config-file CONFIG Configuration file -p, --port PORT Communications port (default 17165) -b, --auto-bind Auto-bind to an unused port number -P, --pid FILE Where to write the PID file -l, --log FILE Where to write the log file -D, --no-daemonize Don't daemonize -v, --version Print the version number and exit
$ god -c simple.god -D[... 20:19:33 #10897] INFO: Using pid file directory: /Users/jnewland/.god/pids[... 20:19:34 #10897] INFO: Started on drbunix:///tmp/god.17165.sock[... 20:19:34 #10897] INFO: crashy move 'unmonitored' to 'up'[... 20:19:34 #10897] INFO: crashy moved 'unmonitored' to 'up'[... 20:19:34 #10897] INFO: crashy [trigger] process is not running (ProcessRunning)[... 20:19:34 #10897] INFO: crashy move 'up' to 'start'[... 20:19:34 #10897] INFO: crashy start: ruby scripts/crashy.rb[... 20:19:34 #10897] INFO: crashy moved 'up' to 'up'[... 20:19:34 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:35 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:36 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:37 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:38 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:39 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:40 #10897] INFO: crashy [trigger] process is not running (ProcessRunning)[... 20:19:40 #10897] INFO: crashy move 'up' to 'start'[... 20:19:40 #10897] INFO: crashy start: ruby scripts/crashy.rb[... 20:19:40 #10897] INFO: crashy moved 'up' to 'up'[... 20:19:40 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:41 #10897] INFO: crashy [ok] process is running (ProcessRunning)
$ god -c simple.god -D[... 20:19:33 #10897] INFO: Using pid file directory: /Users/jnewland/.god/pids[... 20:19:34 #10897] INFO: Started on drbunix:///tmp/god.17165.sock[... 20:19:34 #10897] INFO: crashy move 'unmonitored' to 'up'[... 20:19:34 #10897] INFO: crashy moved 'unmonitored' to 'up'[... 20:19:34 #10897] INFO: crashy [trigger] process is not running (ProcessRunning)[... 20:19:34 #10897] INFO: crashy move 'up' to 'start'[... 20:19:34 #10897] INFO: crashy start: ruby scripts/crashy.rb[... 20:19:34 #10897] INFO: crashy moved 'up' to 'up'[... 20:19:34 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:35 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:36 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:37 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:38 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:39 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:40 #10897] INFO: crashy [trigger] process is not running (ProcessRunning)[... 20:19:40 #10897] INFO: crashy move 'up' to 'start'[... 20:19:40 #10897] INFO: crashy start: ruby scripts/crashy.rb[... 20:19:40 #10897] INFO: crashy moved 'up' to 'up'[... 20:19:40 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:41 #10897] INFO: crashy [ok] process is running (ProcessRunning)
$ god -c simple.god -D[... 20:19:33 #10897] INFO: Using pid file directory: /Users/jnewland/.god/pids[... 20:19:34 #10897] INFO: Started on drbunix:///tmp/god.17165.sock[... 20:19:34 #10897] INFO: crashy move 'unmonitored' to 'up'[... 20:19:34 #10897] INFO: crashy moved 'unmonitored' to 'up'[... 20:19:34 #10897] INFO: crashy [trigger] process is not running (ProcessRunning)[... 20:19:34 #10897] INFO: crashy move 'up' to 'start'[... 20:19:34 #10897] INFO: crashy start: ruby scripts/crashy.rb[... 20:19:34 #10897] INFO: crashy moved 'up' to 'up'[... 20:19:34 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:35 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:36 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:37 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:38 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:39 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:40 #10897] INFO: crashy [trigger] process is not running (ProcessRunning)[... 20:19:40 #10897] INFO: crashy move 'up' to 'start'[... 20:19:40 #10897] INFO: crashy start: ruby scripts/crashy.rb[... 20:19:40 #10897] INFO: crashy moved 'up' to 'up'[... 20:19:40 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:41 #10897] INFO: crashy [ok] process is running (ProcessRunning)
$ god -c simple.god -D[... 20:19:33 #10897] INFO: Using pid file directory: /Users/jnewland/.god/pids[... 20:19:34 #10897] INFO: Started on drbunix:///tmp/god.17165.sock[... 20:19:34 #10897] INFO: crashy move 'unmonitored' to 'up'[... 20:19:34 #10897] INFO: crashy moved 'unmonitored' to 'up'[... 20:19:34 #10897] INFO: crashy [trigger] process is not running (ProcessRunning)[... 20:19:34 #10897] INFO: crashy move 'up' to 'start'[... 20:19:34 #10897] INFO: crashy start: ruby scripts/crashy.rb[... 20:19:34 #10897] INFO: crashy moved 'up' to 'up'[... 20:19:34 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:35 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:36 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:37 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:38 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:39 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:40 #10897] INFO: crashy [trigger] process is not running (ProcessRunning)[... 20:19:40 #10897] INFO: crashy move 'up' to 'start'[... 20:19:40 #10897] INFO: crashy start: ruby scripts/crashy.rb[... 20:19:40 #10897] INFO: crashy moved 'up' to 'up'[... 20:19:40 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:41 #10897] INFO: crashy [ok] process is running (ProcessRunning)
$ god -c simple.god -D[... 20:19:33 #10897] INFO: Using pid file directory: /Users/jnewland/.god/pids[... 20:19:34 #10897] INFO: Started on drbunix:///tmp/god.17165.sock[... 20:19:34 #10897] INFO: crashy move 'unmonitored' to 'up'[... 20:19:34 #10897] INFO: crashy moved 'unmonitored' to 'up'[... 20:19:34 #10897] INFO: crashy [trigger] process is not running (ProcessRunning)[... 20:19:34 #10897] INFO: crashy move 'up' to 'start'[... 20:19:34 #10897] INFO: crashy start: ruby scripts/crashy.rb[... 20:19:34 #10897] INFO: crashy moved 'up' to 'up'[... 20:19:34 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:35 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:36 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:37 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:38 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:39 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:40 #10897] INFO: crashy [trigger] process is not running (ProcessRunning)[... 20:19:40 #10897] INFO: crashy move 'up' to 'start'[... 20:19:40 #10897] INFO: crashy start: ruby scripts/crashy.rb[... 20:19:40 #10897] INFO: crashy moved 'up' to 'up'[... 20:19:40 #10897] INFO: crashy [ok] process is running (ProcessRunning)[... 20:19:41 #10897] INFO: crashy [ok] process is running (ProcessRunning)
$ god -c simple.god$
$ god -c simple.god$ ps ax | grep ruby12512 ?? Ss 0:00.03 ruby /Users/jnewland/src/god_examples/scripts/crashy.rb12484 s001 S 0:00.36 /usr/bin/ruby /usr/bin/god -c simple.god
$ god -c simple.god$ ps ax | grep ruby12512 ?? Ss 0:00.03 ruby /Users/jnewland/src/god_examples/scripts/crashy.rb12484 s001 S 0:00.36 /usr/bin/ruby /usr/bin/god -c simple.god$ god -h...Commands: start <task or group name> start task or group restart <task or group name> restart task or group stop <task or group name> stop task or group monitor <task or group name> monitor task or group unmonitor <task or group name> unmonitor task or group remove <task or group name> remove task or group from god load <file> load a config into a running god log <task name> show realtime log for given task status show status of each task quit stop god terminate stop god and all tasks check run self diagnostic
$ god statuscrashy: up$ god restart crashySending 'restart' command
The following watches were affected: crashy$ god stop crashySending 'stop' command
The following watches were affected: crashy$ god statuscrashy: unmonitored$ god start crashySending 'start' command
The following watches were affected: crashy$ god statuscrashy: up
ControllingLeaky Processes
#leaky.godGod.watch do |w| w.name = "leaky" w.interval = 5.seconds w.start = 'ruby scripts/leaky.rb'
w.start_if do |start| start.condition(:process_running) do |c| c.running = false end end
w.restart_if do |restart| restart.condition(:memory_usage) do |c| c.above = 2.megabytes end endend
CPU Usage
w.restart_if do |restart| restart.condition(:cpu_usage) do |c| c.above = 50.percent c.times = [3, 5] end end
HTTP Status Codes
w.restart_if do |restart| restart.condition(:http_response_code) do |c| c.host = 'localhost' c.port = '80' c.path = '/heartbeat' c.code_is_not = %w(200 304) end end
Notifications
#email_contacts.godGod::Contacts::Email.message_settings = { :from => '[email protected]'}
God::Contacts::Email.server_settings = { :address => "smtp.jnewland.com", :port => 25, :domain => "jnewland.com", :authentication => :plain, :user_name => "god", :password => ""}
God.contact(:email) do |c| c.name = 'jesse' c.email = '[email protected]'end
#http://github.com/mojombo/god/tree/master/lib/god/contacts/jabber.rbrequire 'jabber'
God::Contacts::Jabber.settings = { :jabber_id => '[email protected]', :password => ' ' }
God.contact(:jabber) do |c| c.name = 'jesse' c.jabber_id = '[email protected]'end
w.restart_if do |restart| restart.condition(:cpu_usage) do |c| c.above = 50.percent c.times = [3, 5] c.notify = "jesse" end end
MonitoringMongrels
Putting it all together
• Process Running
• Memory Usage
• CPU Usage
• HTTP Response Code
• Notifications
• Capistrano?
• Web Interface?
#rails/config/god/app.godRAILS_ROOT = ENV['RAILS_ROOT'] ||= "/var/www/apps/test/current"RUBY = `which ruby`.chompMONGREL_RAILS = `which mongrel_rails`.chompRAILS_ENV = ENV['RAILS_ENV'] ||= 'production'MONGRELS = 2MONGREL_START_PORT= 3000USER = GROUP = 'deploy'
0.upto(MONGRELS-1) do |n| port = MONGREL_START_PORT+n God.watch do |w| w.group = 'mongrels' w.name = "mongrel_#{port}" w.uid = USER w.gid = GROUP w.interval = 30.seconds w.start = "#{RUBY} #{MONGREL_RAILS} start --environment #{RAILS_ENV} --chdir #{RAILS_ROOT} --port #{port}" w.start_grace = 90.seconds w.restart_grace = 90.seconds w.log = File.join(RAILS_ROOT, "log/mongrel_#{port}.log")
#process running
#memory usage
#cpu usage
#http response code enddo
class PulseController < ApplicationController session :off def pulse if (ActiveRecord::Base.connection.execute("select 1").num_rows rescue 0) == 1 render :text => "OK #{Time.now.utc.to_s(:db)}" else render :text => 'ERROR', :status => :internal_server_error end endend
Pulse Controller
Capistrano
#rails/config/deploy.rbrole :app, "test.jnewland.com"
require 'san_juan'san_juan.role :app, %w(mongrels)
#overwrite the default start, stop, and restart tasks to use godnamespace :deploy do
desc "Use god to restart the app" task :restart do god.all.reload god.app.mongrels.restart end
desc "Use god to start the app" task :start do god.all.start end
desc "Use god to stop the app" task :stop do god.all.terminate end
end
$ cap -T
...
cap god:all:quit # Quit god, but not the processes it's monitoringcap god:all:reload # Reloading God Configcap god:all:start # Start godcap god:all:start_interactive # Start god interactivelycap god:all:status # Describe the status of the running tasks on ...cap god:all:terminate # Terminate god and all monitored processescap god:app:mongrels:log # Log mongrelscap god:app:mongrels:remove # Remove mongrelscap god:app:mongrels:restart # Restart mongrelscap god:app:mongrels:start # Start mongrelscap god:app:mongrels:stop # Stop mongrelscap god:app:mongrels:unmonitor # Unmonitor mongrelscap god:app:quit # Quit god, but not the processes it's monitoringcap god:app:reload # Reload the god config filecap god:app:start # Start godcap god:app:start_interactive # Start god interactivelycap god:app:status # Describe the status of the running taskscap god:app:terminate # Terminate god and all monitored processes
...
http://github.com/jnewland/san_juan
ZOMG WHAT NOW?
#rails/config/god/app.god
...
require 'god_web'GodWeb.watch(:port => 3003)
...
http://github.com/jnewland/god_web
AdvancedFeatures
#jabber_bot.god w.restart_if do |restart| restart.condition(:lambda) do |c| c.interval = 15.seconds c.lambda = lambda do require 'xmpp4r-simple' im = Jabber::Simple.new( '[email protected]', PASSWORDS['[email protected]'] ) im.deliver('[email protected]', 'ping') sleep(5) return true unless im.received_messages? chat = im.received_messages.find { |msg| msg.type == :chat} return true unless chat.body =~ /pong/ end end end
Lambda Conditions
#custom_behavior.godmodule God module Behaviors class Speak < Behavior
def before_start `say "Starting now"` 'announced start' end
def before_stop `say "Stopping now"` 'announced stop' end
end endend
God.watch do |w| ... w.behavior(:speak) ...end
Behaviors
#mongrel_cluster.godrequire 'lib/god_mongrel_cluster'
Dir.glob('/etc/mongrel_cluster/*.conf').each do |mongrel_cluster| cluster = GodMongrelCluster.new(mongrel_cluster) cluster.watchend
mongrel_cluster
Questions?
http://www.flickr.com/photos/stuckincustoms/522313332/http://www.flickr.com/photos/91499534@N00/2335651912/http://www.flickr.com/photos/code_martial/1411893703/http://www.flickr.com/photos/extranoise/163847669/http://www.flickr.com/photos/vanz/2480741207/http://www.flickr.com/photos/smartjunco/281071006/http://www.flickr.com/photos/davesag/8312984/http://www.flickr.com/photos/gaetanlee/298178764/http://www.flickr.com/photos/vrogy/511644410/http://www.flickr.com/photos/jeffsmallwood/299208539/http://www.flickr.com/photos/cjdaniel/2240123159/http://www.flickr.com/photos/bobbygreg/139080175/http://www.flickr.com/photos/lordelo/12958772/
Hooray Flickr! (And Creative Commons)
http://creativecommons.org/licenses/by-sa/2.0/deed.en