server administration in python with fabric, cuisine and watchdog

170
ffunction inc. Fabric, Cuisine & Watchdog Sébastien Pierre, ffunction inc. @Montréal Python, February 2011 www.ffctn.com

Upload: confoo

Post on 01-Jun-2015

6.871 views

Category:

Documents


1 download

TRANSCRIPT

ffunctioninc.

Fabric, Cuisine & Watchdog

Sébastien Pierre, ffunction inc.@Montréal Python, February 2011

www.ffctn.com

ffunctioninc.

How to use Python for

Server AdministrationThanks to

FabricCuisine*

& Watchdog**custom tools

ffunctioninc.

The way we useservers

has changed

ffunctioninc.

WEBSERVER

The era of dedicated servers

DATABASESERVER

EMAILSERVER

Hosted in your server room or in colocation

ffunctioninc.

WEBSERVER

The era of dedicated servers

DATABASESERVER

EMAILSERVER

Hosted in your server room or in colocation

Sysadmins typicallySSH and configure

the servers live

Sysadmins typicallySSH and configure

the servers live

ffunctioninc.

WEBSERVER

The era of dedicated servers

DATABASESERVER

EMAILSERVER

Hosted in your server room or in colocation

The servers areconservatively managed,

updates are risky

The servers areconservatively managed,

updates are risky

ffunctioninc.

SLICE 1

The era of slices/VPS

SLICE 10

Linode.com

SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6

Amazon Ec2

We now have multiplesmall virtual servers

(slices/VPS)

We now have multiplesmall virtual servers

(slices/VPS)

ffunctioninc.

SLICE 1

The era of slices/VPS

SLICE 10

Linode.com

SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6

Amazon Ec2

Often located in differentdata-centers

Often located in differentdata-centers

ffunctioninc.

SLICE 1

The era of slices/VPS

SLICE 10

Linode.com

SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6

Amazon Ec2

...and sometimes withdifferent providers

...and sometimes withdifferent providers

ffunctioninc.

SLICE 1

The era of slices/VPS

SLICE 10

Linode.com

SLICE 11SLICE 9SLICE 1SLICE 1SLICE 1SLICE 1SLICE 6

Amazon Ec2

DEDICATEDSERVER 1

DEDICATEDSERVER 2

IWeb.com

We even sometimesstill have physical,dedicated servers

We even sometimesstill have physical,dedicated servers

ffunctioninc.

The challenge

ORDERSERVER

ffunctioninc.

The challenge

ORDERSERVER

SETUPSERVER

ffunctioninc.

The challenge

ORDERSERVER

SETUPSERVER

Create users, groupsCustomize config filesInstall base packages

Create users, groupsCustomize config filesInstall base packages

ffunctioninc.

The challenge

ORDERSERVER

SETUPSERVER

DEPLOYAPPLICATION

ffunctioninc.

The challenge

ORDERSERVER

SETUPSERVER

DEPLOYAPPLICATION

Install app-specificpackages

deploy applicationstart services

Install app-specificpackages

deploy applicationstart services

ffunctioninc.

The challenge

ORDERSERVER

SETUPSERVER

DEPLOYAPPLICATION

MAKE THIS PROCESS AS FAST (AND SIMPLE)AS POSSIBLE

ffunctioninc.

The challenge

ffunctioninc.

The challenge

Quickly integrate yournew server in the

existing architecture

Quickly integrate yournew server in the

existing architecture

ffunctioninc.

The challenge ...and make sureit's running!

...and make sureit's running!

ffunctioninc.

Today's menu

FABRICInteract with your remote machinesas if they were local

ffunctioninc.

Today's menu

FABRIC

CUISINE

Interact with your remote machinesas if they were local

Takes care of users, group, packagesand configuration of your new machine

ffunctioninc.

Today's menu

FABRIC

CUISINE

WATCHDOG

Interact with your remote machinesas if they were local

Takes care of users, group, packagesand configuration of your new machine

Ensures that your servers and servicesare up and running

ffunctioninc.

Today's menu

FABRIC

CUISINE

WATCHDOG

Interact with your remote machinesas if they were local

Takes care of users, group, packagesand configuration of your new machine

Ensures that your servers and servicesare up and running

Made byMade by

ffunctioninc.

Part 1

Fabric - http://fabfile.org

application deployment & systems administration tasks

ffunctioninc.

Fabric is a Python library and command-line tool

for streamlining the use of SSHfor application deployment

or systems administration tasks.

ffunctioninc.

Fabric is a Python library and command-line tool

for streamlining the use of SSHfor application deployment

or systems administration tasks.

Wait... what doesthat mean ?

Wait... what doesthat mean ?

ffunctioninc.

Streamlining SSH

version = os.popen(“ssh myserver 'cat /proc/version'”).read()

version = run(“cat /proc/version”)

By hand:

Using Fabric:

ffunctioninc.

Streamlining SSH

version = os.popen(“ssh myserver 'cat /proc/version').read()

from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)

By hand:

Using Fabric:

ffunctioninc.

Streamlining SSH

version = os.popen(“ssh myserver 'cat /proc/version').read()

from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)

By hand:

Using Fabric:

You can specify multiple hosts and runthe same commands

across them

You can specify multiple hosts and runthe same commands

across them

ffunctioninc.

Streamlining SSH

version = os.popen(“ssh myserver 'cat /proc/version').read()

from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)

By hand:

Using Fabric:

Connections will belazily created and

pooled

Connections will belazily created and

pooled

ffunctioninc.

Streamlining SSH

version = os.popen(“ssh myserver 'cat /proc/version').read()

from fabric.api import *env.hosts = [“myserver”]version = run(“cat /proc/version”)

By hand:

Using Fabric:

Failures ($STATUS) willbe handled just like in Make

Failures ($STATUS) willbe handled just like in Make

ffunctioninc.

Example: Installing packages

sudo(“aptitude install nginx”)

if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1:

sudo("aptitude install '%s'" % (package)

ffunctioninc.

Example: Installing packages

sudo(“aptitude install nginx”)

if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1:

sudo("aptitude install '%s'" % (package)

It's easy to take actiondepending on the result

It's easy to take actiondepending on the result

ffunctioninc.

Example: Installing packages

sudo(“aptitude install nginx”)

if run("dpkg -s %s | grep 'Status:' ; true" % package).find("installed") == -1:

sudo("aptitude install '%s'" % (package)

Note that we add trueso that the run() always

succeeds** there are other ways...

Note that we add trueso that the run() always

succeeds** there are other ways...

ffunctioninc.

Example: retrieving system status

disk_usage = run(“df -kP”)mem_usage = run(“cat /proc/meminfo”)cpu_usage = run(“cat /proc/stat”

print disk_usage, mem_usage, cpu_info

ffunctioninc.

Example: retrieving system status

disk_usage = run(“df -kP”)mem_usage = run(“cat /proc/meminfo”)cpu_usage = run(“cat /proc/stat”

print disk_usage, mem_usage, cpu_info

Very useful for gettinglive information from

many different servers

Very useful for gettinglive information from

many different servers

ffunctioninc.

Fabfile.py

from fabric.api import *from mysetup import *

env.host = [“server1.myapp.com”]

def setup(): install_packages(“...”) update_configuration() create_users() start_daemons()

$ fab setup

ffunctioninc.

Fabfile.py

from fabric.api import *from mysetup import *

env.host = [“server1.myapp.com”]

def setup(): install_packages(“...”) update_configuration() create_users() start_daemons()

$ fab setup

Just like Make, youwrite rules that do

something

Just like Make, youwrite rules that do

something

ffunctioninc.

Fabfile.py

from fabric.api import *from mysetup import *

env.host = [“server1.myapp.com”]

def setup(): install_packages(“...”) update_configuration() create_users() start_daemons()

$ fab setup

...and you can specifyon which servers the rules

will run

...and you can specifyon which servers the rules

will run

ffunctioninc.

Multiple hosts

@hosts(“db1.myapp”)def backup_db():

run(...)

env.hosts = [“db1.myapp.com”,“db2.myapp.com”,“db3.myapp.com”

]

ffunctioninc.

Roles

$ fab -R web setup

env.roledefs = { 'web': ['www1', 'www2', 'www3'], 'dns': ['ns1', 'ns2']}

ffunctioninc.

Roles

$ fab -R web setup

env.roledefs = { 'web': ['www1', 'www2', 'www3'], 'dns': ['ns1', 'ns2']}

Will run the setup ruleonly on hosts members

of the web role.

Will run the setup ruleonly on hosts members

of the web role.

ffunctioninc.

Some facts about Fabric

Fabric 1.0 just released!On March, 4th 2011

3 years of developmentFirst commit 1161 days ago (on March 10th, 2011)

Related ProjectsOpscode's Chef and Puppet

ffunctioninc.

What's good about Fabric?

Low-levelBasically an ssh() command that returns the result

Simple primitivesrun(), sudo(), get(), put(), local(), prompt(), reboot()

No magicNo DSL, no abstraction, just a remote command API

ffunctioninc.

What could be improved ?

Ease common admin tasksUser, group creation. Files, directory operations.

Abstract primitivesLike install package, so that it works with different OS

TemplatesTo make creating/updating configuration files easy

ffunctioninc.

Cuisine:Chef-like functionality for Fabric

ffunctioninc.

Part 2

Cuisine

ffunctioninc.

What is Opscode's Chef?

RecipesScripts/packages to install and configure services and applications

APIA DSL-like Ruby API to interact with the OS (create users, groups, install packages, etc)

ArchitectureClient-server or “solo” mode to push and deploy your new configurations

http://wiki.opscode.com/display/chef/Home

ffunctioninc.

What I liked about Chef

FlexibleYou can use the API or shell commands

StructuredHelped me have a clear decomposition of the services installed per machine

CommunityLots of recipes already available from http://cookbooks.opscode.com/

ffunctioninc.

What I didn't like

Too many files and directoriesCode is spread out, hard to get the big picture

Abstraction overloadAPI not very well documented, frequent fall backs to plain shell scripts within the recipe

No “smart” recipeRecipes are applied all the time, even when it's not necessary

ffunctioninc.

The question that kept coming...

Django recipe: 5 files, 2 directories

sudo aptitude install apache2 python django-python

What it does, in essence

ffunctioninc.

The question that kept coming...

Django recipe: 5 files, 2 directories

sudo aptitude install apache2 python django-python

What it does, in essence

Is this really necessaryfor what I want to do ?

Is this really necessaryfor what I want to do ?

ffunctioninc.

What I loved about Fabric

Bare metalssh() function, simple and elegant set of primitives

No magicNo abstraction, no model, no compilation

Two-way communicationEasy to change the rule's behaviour according to the output (ex: do not install something that's already installed)

ffunctioninc.

What I needed

Fabric

ffunctioninc.

What I needed

Fabric

File I/OFile I/O

ffunctioninc.

What I needed

Fabric

File I/OFile I/O User/GroupManagement

User/GroupManagement

ffunctioninc.

What I needed

Fabric

File I/OFile I/O PackageManagement

PackageManagement

User/GroupManagement

User/GroupManagement

ffunctioninc.

What I needed

Fabric

File I/OFile I/O PackageManagement

PackageManagement

User/GroupManagement

User/GroupManagement

Text processing & TemplatesText processing & Templates

ffunctioninc.

How I wanted it

Simple “flat” API[object]_[operation] where operation is something in “create”, “read”, “update”, “write”, “remove”, “ensure”, etc...

Driven by needOnly implement a feature if I have a real need for it

No magicEverything is implemented using sh-compatible commands

No unnecessary structureEverything fits in one file, no imposed file layout

ffunctioninc.

Cuisine: Example fabfile.py

from cuisine import *

env.host = [“server1.myapp.com”]

def setup():package_ensure(“python”, “apache2”, “python-django”)user_ensure(“admin”, uid=2000)upstart_ensure(“django”)

$ fab setup

ffunctioninc.

Cuisine: Example fabfile.py

from cuisine import *

env.host = [“server1.myapp.com”]

def setup():package_ensure(“python”, “apache2”, “python-django”)user_ensure(“admin”, uid=2000)upstart_ensure(“django”)

$ fab setup

Fabric's core functionsare already imported

Fabric's core functionsare already imported

ffunctioninc.

Cuisine: Example fabfile.py

from cuisine import *

env.host = [“server1.myapp.com”]

def setup():package_ensure(“python”, “apache2”, “python-django”)user_ensure(“admin”, uid=2000)upstart_ensure(“django”)

$ fab setup Cuisine's APIcalls

Cuisine's APIcalls

ffunctioninc.

File I/O

ffunctioninc.

Cuisine : File I/O

● file_exists does remote file exists?● file_read reads remote file● file_write write data to remote file● file_append appends data to remote file● file_attribs chmod & chown● file_remove

ffunctioninc.

Cuisine : File I/O

● file_exists does remote file exists?● file_read reads remote file● file_write write data to remote file● file_append appends data to remote file● file_attribs chmod & chown● file_remove

Supports owner/groupand mode change

Supports owner/groupand mode change

ffunctioninc.

Cuisine : File I/O (directories)

● dir_exists does remote file exists?● dir_ensure ensures that a directory exists● dir_attribs chmod & chown● dir_remove

ffunctioninc.

Cuisine : File I/O +

● file_update(location, updater=lambda _:_)

package_ensure("mongodb-snapshot")def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res)file_update("/etc/mongodb.conf", update_configuration)

ffunctioninc.

Cuisine : File I/O +

● file_update(location, updater=lambda _:_)

package_ensure("mongodb-snapshot")def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res)file_update("/etc/mongodb.conf", update_configuration)

This replaces the values forconfiguration entriesdbpath and logpath

This replaces the values forconfiguration entriesdbpath and logpath

ffunctioninc.

Cuisine : File I/O +

● file_update(location, updater=lambda _:_)

package_ensure("mongodb-snapshot")def update_configuration( text ): res = [] for line in text.split("\n"): if line.strip().startswith("dbpath="): res.append("dbpath=/data/mongodb") elif line.strip().startswith("logpath="): res.append("logpath=/data/logs/mongodb.log") else: res.append(line) return "\n".join(res)file_update("/etc/mongodb.conf", update_configuration)

The remote file will only bechanged if the content

is different

The remote file will only bechanged if the content

is different

ffunctioninc.

User Management

ffunctioninc.

Cuisine: User Management

● user_exists does the user exists?● user_create create the user● user_ensure create the user if it doesn't exist

ffunctioninc.

Cuisine: Group Management

● group_exists does the group exists?● group_create create the group● group_ensure create the group if it doesn't exist● group_user_exists does the user belong to the group?● group_user_add adds the user to the group● group_user_ensure

ffunctioninc.

Package Management

ffunctioninc.

Cuisine: Package Management

● package_exists is the package available ?● package_installed is it installed ?● package_install install the package● package_ensure ... only if it's not installed● package_upgrade upgrades the/all package(s)

ffunctioninc.

Text & Templates

ffunctioninc.

Cuisine: Text transformation

text_ensure_line(text, lines)

file_update("/home/user/.profile", lambda _:text_ensure_line(_,

"PYTHONPATH=/opt/lib/python:${PYTHONPATH};""export PYTHONPATH"

))

ffunctioninc.

Cuisine: Text transformation

text_ensure_line(text, lines)

file_update("/home/user/.profile", lambda _:text_ensure_line(_,

"PYTHONPATH=/opt/lib/python:${PYTHONPATH};""export PYTHONPATH"

))

Ensures that the PYTHONPATHvariable is set and exported,

If not, these lines will beappended.

Ensures that the PYTHONPATHvariable is set and exported,

If not, these lines will beappended.

ffunctioninc.

Cuisine: Text transformation

text_replace_line(text, old, new, find=.., process=...)

configuration = local_read("server.conf")for key, value in variables.items():

configuration, replaced = text_replace_line(configuration,key + "=",key + "=" + repr(value),process=lambda text:text.split("=")[0].strip()

)

ffunctioninc.

Cuisine: Text transformation

text_replace_line(text, old, new, find=.., process=...)

configuration = local_read("server.conf")for key, value in variables.items():

configuration, replaced = text_replace_line(configuration,key + "=",key + "=" + repr(value),process=lambda text:text.split("=")[0].strip()

)

Replaces lines that look likeVARIABLE=VALUE

with the actual values from thevariables dictionary.

Replaces lines that look likeVARIABLE=VALUE

with the actual values from thevariables dictionary.

ffunctioninc.

Cuisine: Text transformation

text_replace_line(text, old, new, find=.., process=...)

configuration = local_read("server.conf")for key, value in variables.items():

configuration, replaced = text_replace_line(configuration,key + "=",key + "=" + repr(value),process=lambda text:text.split("=")[0].strip()

)

The process lambda transformsinput lines before comparing

them.

Here the lines are strippedof spaces and of their value.

The process lambda transformsinput lines before comparing

them.

Here the lines are strippedof spaces and of their value.

ffunctioninc.

Cuisine: Text transformation

text_strip_margin(text)

file_write(".profile", text_strip_margin("""|export PATH="$HOME/bin":$PATH|set -o vi"""

))

ffunctioninc.

Cuisine: Text transformation

text_strip_margin(text)

file_write(".profile", text_strip_margin("""|export PATH="$HOME/bin":$PATH|set -o vi"""

))

Everything after the | separatorwill be output as content.

It allows to easily embed texttemplates within functions.

Everything after the | separatorwill be output as content.

It allows to easily embed texttemplates within functions.

ffunctioninc.

Cuisine: Text transformation

text_template(text, variables)

text_template(text_strip_margin("""|cd ${DAEMON_PATH}|exec ${DAEMON_EXEC_PATH}"""

), dict(DAEMON_PATH="/opt/mongodb",DAEMON_EXEC_PATH="/opt/mongodb/mongod"

))

ffunctioninc.

Cuisine: Text transformation

text_template(text, variables)

text_template(text_strip_margin("""|cd ${DAEMON_PATH}|exec ${DAEMON_EXEC_PATH}"""

), dict(DAEMON_PATH="/opt/mongodb",DAEMON_EXEC_PATH="/opt/mongodb/mongod"

))

This is a simple wrapperaround Python (safe)

string.template() function

This is a simple wrapperaround Python (safe)

string.template() function

ffunctioninc.

Cuisine: Goodies

● ssh_keygen generates DSA keys

● ssh_authorize authorizes your key on the remote server

● mode_sudo run() always uses sudo

● upstart_ensure ensures the given daemon is running

& more!

ffunctioninc.

Cuisine Tips: Structuring your rules

BOOTSTRAP

ffunctioninc.

Cuisine Tips: Structuring your rules

BOOTSTRAP

You just received your newVPS, and you want to set itup so that you have a basesystem that you can accesswithout typing a password

You just received your newVPS, and you want to set itup so that you have a basesystem that you can accesswithout typing a password

ffunctioninc.

Cuisine Tips: Structuring your rules

BOOTSTRAP SETUP

ffunctioninc.

Cuisine Tips: Structuring your rules

BOOTSTRAP SETUP

You install your users, groups,preferred packages andconfiguration. You alsoinstall you applications.

You install your users, groups,preferred packages andconfiguration. You alsoinstall you applications.

ffunctioninc.

Cuisine Tips: Structuring your rules

BOOTSTRAP SETUP UPDATE

ffunctioninc.

Cuisine Tips: Structuring your rules

BOOTSTRAP SETUP UPDATE

You want to deploy the newversion of the application

you just built

You want to deploy the newversion of the application

you just built

ffunctioninc.

Cuisine Tips: Structuring your rules

BOOTSTRAP SETUP UPDATE

def bootstrap():# Secure SSH, create admin user# Authorize SSH public keys# Remove unwanted packages

ffunctioninc.

Cuisine Tips: Structuring your rules

BOOTSTRAP SETUP UPDATE

def setup():# Create directories (ex: /opt/data, /opt/services, etc)# Create user/groups (ex: apps, services, etc)# Install base tools (ex: screen, fail2ban, zsh, etc)# Edit configuration (ex: profile, inputrc, etc)# Install and run your application

ffunctioninc.

Cuisine Tips: Structuring your rules

BOOTSTRAP SETUP UPDATE

def update():# Download your application update# Freeze/stop the running application# Install the update# Reload/restart your application# Test that everything is OK

ffunctioninc.

Why use Cuisine ?

● Simple API for remote-server manipulationFiles, users, groups, packages

● Shell commands for specific tasks onlyAvoid problems with your shell commands by only using run() for very specific tasks

● Cuisine tasks are not stupid*_ensure() commands won't do anything if it's not necessary

ffunctioninc.

Limitations

● Limited to sh-shellsOperations will not work under csh

● Only written/tested for Ubuntu LinuxContributors could easily port commands

ffunctioninc.

Get started !

On Github:http://github.com/sebastien/cuisine

1 short Python fileDocumented API

ffunctioninc.

Part 3

Watchdog

Server and services monitoring

ffunctioninc.

The problem

ffunctioninc.

The problem

Low disk spaceLow disk space

ffunctioninc.

The problem

Archive filesRotate logs

Purge cache

Archive filesRotate logs

Purge cache

ffunctioninc.

The problem HTTP serverhas highlatency

HTTP serverhas highlatency

ffunctioninc.

The problemRestart HTTP

server

Restart HTTPserver

ffunctioninc.

The problem

System loadis too high

System loadis too high

ffunctioninc.

The problem

re-niceimportantprocesses

re-niceimportantprocesses

ffunctioninc.

We want to be notifiedwhen problems occur

ffunctioninc.

We want automatic actions to be taken whenever possible

ffunctioninc.

(Some of the) existing solutions

Monit, God, Supervisord, UpstartFocus on starting/restarting daemons and services

Munin, CactiFocus on visualization of RRDTool data

CollectdFocus on collecting and publishing data

ffunctioninc.

The ideal tool

Wide spectrumData collection, service monitoring, actions

Easy setup and deploymentNo complex installation or configuration

Flexible server architectureCan monitor local or remote processes

Customizable and extensibleFrom restarting deamons to monitoring whole servers

ffunctioninc.

Hello, Watchdog!

SERVICE

ffunctioninc.

Hello, Watchdog!

RULE

SERVICE

ffunctioninc.

Hello, Watchdog!

RULE

SERVICE

A service is acollection of

RULES

A service is acollection of

RULES

ffunctioninc.

Hello, Watchdog!

RULE

SERVICE

HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth

ffunctioninc.

Hello, Watchdog!

RULE

SERVICE

HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth

Each rule retrievesdata and processes it.Rules can SUCCEED

or FAIL

Each rule retrievesdata and processes it.Rules can SUCCEED

or FAIL

ffunctioninc.

Hello, Watchdog!

RULE

ACTION

SERVICE

HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth

ffunctioninc.

Hello, Watchdog!

RULE

ACTION

SERVICE

HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth

LoggingXMPP, Email notificationsStart/stop process….

ffunctioninc.

Hello, Watchdog!

RULE

ACTION

SERVICE

HTTP RequestCPU, Disk, Mem %Process statusI/O Bandwidth

LoggingXMPP, Email notificationsStart/stop process….

Actions are boundto rule, triggeredon rule SUCCESS

or FAILURE

Actions are boundto rule, triggeredon rule SUCCESS

or FAILURE

ffunctioninc.

Execution Model

MONITOR

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

SERVICE DEFINITION

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

SERVICE DEFINITION

Services are registeredin the monitor

Services are registeredin the monitor

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

SERVICE DEFINITION

Rules defined in theservice are executed

every N ms(frequency)

Rules defined in theservice are executed

every N ms(frequency)

Rules defined in theservice are executed

every N ms(frequency)

Rules defined in theservice are executed

every N ms(frequency)

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

ACTION

ACTION

ACTION

SERVICE DEFINITION

SUCCESS FAILURE

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

ACTION

ACTION

ACTION

SERVICE DEFINITION

If the rule SUCCEEDSactions will be

sequentially executed

If the rule SUCCEEDSactions will be

sequentially executed

SUCCESS FAILURE

ffunctioninc.

Execution Model

MONITORRULE

(frequency in ms)

ACTION

ACTION

ACTION

SERVICE DEFINITION

If the rule FAILfailure actions will besequentially executed

If the rule FAILfailure actions will besequentially executed

SUCCESS FAILURE

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

A monitor is like the“main” for Watchdog.

It actively monitorsservices.

A monitor is like the“main” for Watchdog.

It actively monitorsservices.

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

Don't forget to callrun() on it

Don't forget to callrun() on it

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

The service monitorsthe rules

The service monitorsthe rules

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

The HTTP ruleallows to test

an URL

The HTTP ruleallows to test

an URL

And we display amessage in case

of failure

And we display amessage in case

of failure

ffunctioninc.

Monitoring a remote machine

#!/usr/bin/env pythonfrom watchdog import *Monitor(

Service(name = "google-search-latency",monitor = (

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Print("Google search query took more than 50ms")]

))

)).run()

If it there is a 4XX orit timeouts, the rulewill fail and displayan error message

If it there is a 4XX orit timeouts, the rulewill fail and displayan error message

ffunctioninc.

Monitoring a remote machine

$ python example-service-monitoring.py

2011-02-27T22:33:18 watchdog --- #0 (runners=1,threads=2,duration=0.57s)2011-02-27T22:33:18 watchdog [!] Failure on HTTP(GET="www.google.ca:80/search?q=watchdog",timeout=0.08) : Socket error: timed outGoogle search query took more than 50ms2011-02-27T22:33:19 watchdog --- #1 (runners=1,threads=2,duration=0.73s)2011-02-27T22:33:20 watchdog --- #2 (runners=1,threads=2,duration=0.54s)2011-02-27T22:33:21 watchdog --- #3 (runners=1,threads=2,duration=0.69s)2011-02-27T22:33:22 watchdog --- #4 (runners=1,threads=2,duration=0.77s)2011-02-27T22:33:23 watchdog --- #5 (runners=1,threads=2,duration=0.70s)

ffunctioninc.

Sending Email Notification

send_email = Email("[email protected]","[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword"

)

[…]HTTP(

GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

send_email]

)

ffunctioninc.

Sending Email Notification

send_email = Email("[email protected]","[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword"

)

[…]HTTP(

GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

send_email]

)

The Email rule will sendan email to

[email protected] triggered

The Email rule will sendan email to

[email protected] triggered

ffunctioninc.

Sending Email Notification

send_email = Email("[email protected]","[Watchdog]Google Search Latency Error", "Latency was over 80ms", "smtp.gmail.com", "myusername", "mypassword"

)

[…]HTTP(

GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

send_email]

)

This is how we bind theaction to the rule failure

This is how we bind theaction to the rule failure

ffunctioninc.

Sending Email+Jabber Notification

send_xmpp = XMPP("[email protected]","Watchdog: Google search latency over 80ms","[email protected]", "myspassword"

)

[…]HTTP(

GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

send_email, send_xmpp]

)

ffunctioninc.

Monitoring incident: when something fails repeatedly during a given period of

time

ffunctioninc.

Monitoring incident: when something fails repeatedly during a given period of

time

You don't want to benotified all the time,only when it really

matters.

You don't want to benotified all the time,only when it really

matters.

ffunctioninc.

Detecting incidents

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]

)]

)

ffunctioninc.

Detecting incidents

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]

)]

)

An incident is a “smart”action : it will only dosomething when the

condition is met

An incident is a “smart”action : it will only dosomething when the

condition is met

ffunctioninc.

Detecting incidents

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]

)]

)

When at least 5 errors...When at least 5 errors...

ffunctioninc.

Detecting incidents

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]

)]

)

...happen over a 10seconds period

...happen over a 10seconds period

ffunctioninc.

Detecting incidents

HTTP(GET="http://www.google.ca/search?q=watchdog",freq=Time.s(1),timeout=Time.ms(80),fail=[

Incident(errors = 5,during = Time.s(10),actions = [send_email,send_xmpp]

)]

)

The Incident action willtrigger the given actions

The Incident action willtrigger the given actions

ffunctioninc.

Example: Ensuring a service is running

from watchdog import *Monitor(

Service(name="myservice-ensure-up",monitor=(

HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[

Incident(errors=5,during=Time.s(5),actions=[

Restart("myservice-start.py")])])))).run()

ffunctioninc.

Example: Ensuring a service is running

from watchdog import *Monitor(

Service(name="myservice-ensure-up",monitor=(

HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[

Incident(errors=5,during=Time.s(5),actions=[

Restart("myservice-start.py")])])))).run()

We test if we canGET http://localhost:8000

within 500ms

We test if we canGET http://localhost:8000

within 500ms

ffunctioninc.

Example: Ensuring a service is running

from watchdog import *Monitor(

Service(name="myservice-ensure-up",monitor=(

HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[

Incident(errors=5,during=Time.s(5),actions=[

Restart("myservice-start.py")])])))).run()

If we can't reach it during5 seconds

If we can't reach it during5 seconds

ffunctioninc.

Example: Ensuring a service is running

from watchdog import *Monitor(

Service(name="myservice-ensure-up",monitor=(

HTTP(GET="http://localhost:8000/",freq=Time.ms(500),fail=[

Incident(errors=5,during=Time.s(5),actions=[

Restart("myservice-start.py")])])))).run()

We kill and restartmyservice-start.py

We kill and restartmyservice-start.py

ffunctioninc.

Example: Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

SystemInfo will retrievesystem information andreturn it as a dictionary

SystemInfo will retrievesystem information andreturn it as a dictionary

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

We log each result byextracting the given

value from the resultdictionary (memoryUsage,

diskUsage,cpuUsage)

We log each result byextracting the given

value from the resultdictionary (memoryUsage,

diskUsage,cpuUsage)

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda v:v["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

Bandwidth collectsnetwork interface

live traffic information

Bandwidth collectsnetwork interface

live traffic information

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

But we don't want thetotal amount, we justwant the difference.Delta does just that.

But we don't want thetotal amount, we justwant the difference.Delta does just that.

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent=")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

We print the resultas before

We print the resultas before

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent=")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

SystemHealth willfail whenever the usage

is above the giventhresholds

SystemHealth willfail whenever the usage

is above the giventhresholds

ffunctioninc.

Monitoring system health

from watchdog import *Monitor (

Service(name = "system-health",monitor = (

SystemInfo(freq=Time.s(1),success = (

LogResult("myserver.system.mem=", extract=lambda r,_:r["memoryUsage"]),LogResult("myserver.system.disk=", extract=lambda

r,_:reduce(max,r["diskUsage"].values())),LogResult("myserver.system.cpu=", extract=lambda r,_:r["cpuUsage"]),

)),Delta(

Bandwidth("eth0", freq=Time.s(1)),extract = lambda _:_["total"]["bytes"]/1000.0/1000.0,success = [LogResult("myserver.system.eth0.sent=")]

),SystemHealth(

cpu=0.90, disk=0.90, mem=0.90,freq=Time.s(60),fail=[Log(path="watchdog-system-failures.log")]

),)

)).run()

We'll log failuresin a log file

We'll log failuresin a log file

ffunctioninc.

Watchdog: Decentralized architecture

APPSERVER

STATIC FILESERVER

DB SERVERSERVER

ffunctioninc.

Watchdog: Decentralized architecture

APPSERVER

W

STATIC FILESERVER

DB SERVERSERVER

Ensures the App isrunning

(pid & HTTP test)

Ensures the App isrunning

(pid & HTTP test)

ffunctioninc.

Watchdog: Decentralized architecture

APPSERVER

W

STATIC FILESERVER

W

DB SERVERSERVER

Ensures the static fileserver is running

an has lowlatency

Ensures the static fileserver is running

an has lowlatency

ffunctioninc.

Watchdog: Decentralized architecture

APPSERVER

W

STATIC FILESERVER

W

DB SERVERSERVER

W

Ensures the DB isrunning and that

queriesare not too slow.

Ensures the DB isrunning and that

queriesare not too slow.

ffunctioninc.

Watchdog: Centralized Architecture

APPSERVER

STATIC FILESERVER

DB SERVERSERVER

ffunctioninc.

Watchdog: Centralized Architecture

APPSERVER

STATIC FILESERVER

DB SERVERSERVER

PLATFORMSERVER

ffunctioninc.

Watchdog: Centralized Architecture

APPSERVER

STATIC FILESERVER

DB SERVERSERVER

PLATFORMSERVER

W

Does high-level (HTTP,SQL) queries on theservers and execute

actions remotelywhen problems

are detected

Does high-level (HTTP,SQL) queries on theservers and execute

actions remotelywhen problems

are detected

ffunctioninc.

Watchdog: Deploying on Ubuntu

UPSTART!UPSTART!

ffunctioninc.

Watchdog: Deploying on Ubuntu

# upstart - Watchdog Configuration File# =====================================# updated: 2011-02-28

description "Watchdog - service monitoring daemon"author "Sebastien Pierre <[email protected]>"

start on (net-device-up and local-filesystems)stop on runlevel [016]

respawn

script # NOTE: Change this to wherever the watchdog is installed WATCHDOG_HOME=/opt/services/watchdog cd $WATCHDOG_HOME # NOTE: Change this to wherever your custom watchdog script is installed python watchdog.pyend script

console output# EOF

ffunctioninc.

Watchdog: Deploying on Ubuntu

# upstart - Watchdog Configuration File# =====================================# updated: 2011-02-28

description "Watchdog - service monitoring daemon"author "Sebastien Pierre <[email protected]>"

start on (net-device-up and local-filesystems)stop on runlevel [016]

respawn

script # NOTE: Change this to wherever the watchdog is installed WATCHDOG_HOME=/opt/services/watchdog cd $WATCHDOG_HOME # NOTE: Change this to wherever your custom watchdog script is installed python watchdog.pyend script

console output# EOF

Save this file as/etc/init/watchdog.conf

Save this file as/etc/init/watchdog.conf

ffunctioninc.

Watchdog: Overview

Monitoring DSLDeclarative programming to define monitoring strategy

Wide spectrumFrom data collection to incident detection

FlexibleDoes not impose a specific architecture

ffunctioninc.

Watchdog: Use cases

Ensure service availabilityTest and stop/restart when problems

Collect system statisticsLog or send data through the network

Alert on system or service healthTake actions when the system stats is above threshold

ffunctioninc.

Watchdog: What's coming?

ZeroMQ channelsData streaming and inter-watchdog comm.

DocumentationOnly the basics, need more love!

Contributors?Codebase is small and clear, start hacking!

ffunctioninc.

Get started !

On Github:http://github.com/sebastien/watchdog

1 Python fileDocumented API

ffunctioninc.

Merci !

[email protected]/sebastien