hawkeye a monitoring and management tool for distributed systems

25
1 www.cs.wisc.edu/condor HawkEye A Monitoring and Management Tool for Distributed Systems Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison http://www.cs.wisc.edu/condor [email protected]

Upload: kele

Post on 06-Jan-2016

18 views

Category:

Documents


0 download

DESCRIPTION

HawkEye A Monitoring and Management Tool for Distributed Systems. Todd Tannenbaum Department of Computer Sciences University of Wisconsin-Madison http://www.cs.wisc.edu/condor [email protected]. What does Condor have?. …lots of core technology for building a distributed system. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: HawkEye A Monitoring and Management Tool for Distributed Systems

1www.cs.wisc.edu/condor

HawkEyeA Monitoring and

Management Tool for

Distributed Systems Todd Tannenbaum

Department of Computer SciencesUniversity of Wisconsin-Madisonhttp://www.cs.wisc.edu/condor

[email protected]

Page 2: HawkEye A Monitoring and Management Tool for Distributed Systems

2www.cs.wisc.edu/condor

What does Condor have?› …lots of core technology for building a

distributed system

Page 3: HawkEye A Monitoring and Management Tool for Distributed Systems

3www.cs.wisc.edu/condor

What does Condor have?› …lots of core technology for building a

distributed system› …lots of core technology for monitoring

the status of a machine

Page 4: HawkEye A Monitoring and Management Tool for Distributed Systems

4www.cs.wisc.edu/condor

What does Condor have?› …lots of core technology for building a

distributed system› …lots of core technology for monitoring

the status of a machine› …lots of core technology for managing

a work load of tasks

Page 5: HawkEye A Monitoring and Management Tool for Distributed Systems

5www.cs.wisc.edu/condor

What does Condor have?› …lots of core technology for building a

distributed system› …lots of core technology for monitoring

the status of a machine› …lots of core technology for managing

a work load of tasks› …lots of really, truly, skilled and

experienced developers and researchers at building distributed systems. Some of the best. Standout state employees. Honest. Email for Wisconsin Gov Scott McCallum:

[email protected]

Page 6: HawkEye A Monitoring and Management Tool for Distributed Systems

6www.cs.wisc.edu/condor

One day an avid Condor user asked:

Page 7: HawkEye A Monitoring and Management Tool for Distributed Systems

7www.cs.wisc.edu/condor

One day an avid Condor user asked:

Say, could Condor Technology be

used for distributed

system administration??

Page 8: HawkEye A Monitoring and Management Tool for Distributed Systems

8www.cs.wisc.edu/condor

Time to think…› Gathered up our experiences with

our own management tasks, looked at the mature Condor technology available to us, and HawkEye effort was born.

› Completely separate from Condor from end user prospective. Can install HawkEye, or Condor, or both

Page 9: HawkEye A Monitoring and Management Tool for Distributed Systems

9www.cs.wisc.edu/condor

First Component: MONITORING

› Sysadmins first need information about what is happening on the machines they are responsible for. Both Current and Past Information must be consolidated and

easily accessible Information must be dynamic

Page 10: HawkEye A Monitoring and Management Tool for Distributed Systems

10www.cs.wisc.edu/condor

Condor ClassAds› Technology for an entity to

describe itself

› Simple attribute value pairs [

load_average = 1.3free_Swap_space_mb = 140number_of_processes = 92keyboard_idle_secs = 6ram = 128total_swap = 512total_memory = ram + total_swapbusy = load_average > 1.0

]

Page 11: HawkEye A Monitoring and Management Tool for Distributed Systems

11www.cs.wisc.edu/condor

Condor ClassAds, cont.› No fixed schema› Attributes can contain values or

expressions› Serialize Ads in XML› Open source libraries on C++ and Java

to: Manipulate Ads and Ad attributes Store Ads Query collections of Ads

› Bindings for Perl and others on the way…

Page 12: HawkEye A Monitoring and Management Tool for Distributed Systems

12www.cs.wisc.edu/condor

HawkEye Monitoring Agent

HawkEye Monitoring Agent

HawkEye Manager ClassAd

UpdatesVia SecureUDP

Page 13: HawkEye A Monitoring and Management Tool for Distributed Systems

13www.cs.wisc.edu/condor

HawkEye Monitoring Agent

HawkEye Monitoring Agent

HawkEye Manager HawkEye Monitoring Agent

HawkEye Monitoring Agent

HawkEye Monitoring Agent

Page 14: HawkEye A Monitoring and Management Tool for Distributed Systems

14www.cs.wisc.edu/condor

HawkEye Monitoring Agent

/proc, kstat…

Hawkeye_Startup_Agent

Hawkeye_Monitor

HawkEye Monitoring Agent

HawkEye Manager ClassAd

UpdatesVia SecureUDP

Page 15: HawkEye A Monitoring and Management Tool for Distributed Systems

15www.cs.wisc.edu/condor

Monitor Agent, cont.

› Updates are sent periodically Information does not get stale

› Updates also serve as a heartbeat monitor Know when a machine is down

› Out of the box, the update ClassAd has many attributes about the machine of interest for system administration Current Prototype = 184 attributes

Page 16: HawkEye A Monitoring and Management Tool for Distributed Systems

16www.cs.wisc.edu/condor

What if I want to monitor

something you didn’t think

about?

Page 17: HawkEye A Monitoring and Management Tool for Distributed Systems

17www.cs.wisc.edu/condor

Custom Attributes

/proc, kstat…

Hawkeye_Startup_Agent

Hawkeye_Monitor

HawkEye Monitoring Agent

HawkEye Manager

Data from hawkeye_update_attribute

command line tool

Create your ownHawkEye plugins,or share plugins with others

Page 18: HawkEye A Monitoring and Management Tool for Distributed Systems

18www.cs.wisc.edu/condor

Role of HawkEye Manager

› Store all incoming ClassAds in a indexed resident data structure Fast response to client tool queries about

current state “Show me all machines with a load average >

10”

› Periodically store ClassAd attributes into a Round Robin Database Store information over time “Show me a graph with the load average for

this machine over the past week”

› Speak to clients via CEDAR, HTTP

HawkEye Manager

Page 19: HawkEye A Monitoring and Management Tool for Distributed Systems

Several different clients

› Command-line, GUI, Web-based

Page 20: HawkEye A Monitoring and Management Tool for Distributed Systems

20www.cs.wisc.edu/condor

But sysadmins also sometimes have to do

work…

› Task: copy a new library onto the local disk of each machine. Just a script to copy via rcp/scp to

every machine… or is it?

Page 21: HawkEye A Monitoring and Management Tool for Distributed Systems

21www.cs.wisc.edu/condor

Running tasks on behalf of the sysadmin

› Submit your sysadmin tasks to HawkEye Tasks are stored in a persistent queue by

the Manager Tasks can leave the queue upon completion,

or repeat after specified intervals Tasks can have complex interdependencies

via DAGMan Records are kept on which task ran where

› Sounds like Condor, eh? Yes, but simpler…

Page 22: HawkEye A Monitoring and Management Tool for Distributed Systems

22www.cs.wisc.edu/condor

Run Tasks in response to monitoring information› ClassAd “Requirements” Attribute

› Example: Send email if a machine is low on disk space or low on swap space Submit an email task with an attribute:

Requirements = free_disk < 5 || free_swap < 5

› Example w/ task interdependency: If load average is high and OS=Linux and console is Idle, submit a task which runs “top”, if top sees Netscape, submit a task to kill Netscape

Page 23: HawkEye A Monitoring and Management Tool for Distributed Systems

23www.cs.wisc.edu/condor

HawkEye Design Goals› Monitoring

Reliable presence Get Data off the node in an extensible, consistent

manner

› Run Tasks In response to probe information Repeat or once-only semantics Audit Log

› Independent and self-contained› Cross-Platform

Page 24: HawkEye A Monitoring and Management Tool for Distributed Systems

24www.cs.wisc.edu/condor

Current Status

› Just Beginning this project

› Initial release early summer

› Prototypes already running – Stop in and see initial HawkEye Work

Rm 3385 on Weds 9am – 12pm

Page 25: HawkEye A Monitoring and Management Tool for Distributed Systems

25www.cs.wisc.edu/condor

Thank you!

I was an overworked

sysadmin. Now I have more free time thanks to

HawkEye!