baking-in transparency

Baking-In Transparency

Saturday, October 8, 11

About Me

• Matt Simmons

• 11+ year System Administrator

• http://www.standalone-sysadmin.com

• @standaloneSA

• [email protected]


http://www.standalone-sysadmin.com


mailto:[email protected]

mailto:[email protected]

The Situation


Devs make things

• Small discrete programs

• Large complex programs

• Immense interconnected software suites


Ops makes things go

• Script using small discrete programs

• Administer large complex programs

• Cluster immense interconnected software suites


There is a

direct relationship

between the software that developers write and the

software that gets implemented by operations.


The Problems


Software needs to be monitored

"When performance is measured, performance improves. When performance is measured and reported back, the rate of improvement accelerates."

--Pearson’s Law


“You can’t manage what you can’t measure”--Robert Kaplan

Why?


Software needs to be managedClearly we need to

“Management by objective works - if you know the objective. 90% of the time, you don’t.”

--Peter Drucker


Clearly we need to measure...

But what do we measure?

And what metrics do we use?

How do we obtain the measurements?


Software Engineers measure...• Programmer Productivity

• code size/efficiency

• Defect Density

• Bugs / module size

• Requirement Stability

• “feature creep”

What do we measure?


Operations measures...• Resource Utilization

• Diskspace, Bandwidth, etc

• Infrastructure Stability

• Service Uptime, MTBF, etc

• Performance

• CPU / Memory efficiency, etc

What do we measure?


What metrics do we use?

It depends.

Duh.


The metrics that Ops needs to monitor are not always easy to obtain...


...even though they’re really important

• Reliability

• Repeatability

• Root Cause Identification


...so not only is monitoring important...


Monitoring is hard.


Monitoring is hard.correctly

V


Why is monitoring hard?

• Monitoring Software Suites are complex

• Infrastructures are complex

• Processes and applications are opaque to our futile requests to determine and track internal state


Processes and applications are opaque to our futile

requests to determine and track internal state


The Solution(s)


Dev/Ops working together gives

• Team Interrelationships

• Knowledge Sharing

• Cross Training

• Tool Sharing


But more specifically...

Methods of monitoring software can be

BUILT INTO THE SOFTWARE


How things are designed now

Question:

Answer:

A well-designed program encounters an error. What happens?

It handles the error, and continues processing requests


How things are designed now

Question:

Answer:

A poorly-designed program encounters an error. What happens?

It crashes and burns


Which of those is easier to monitor?Question:


Obviously, dying to alert the monitoring system is overkill.

(pun firmly intended)


How do we make our statuses available to the monitoring system, then?

It depends on the kind of software


Remember these?

• Small discrete programs

• Large complex programs

• Immense interconnected software suites


Small Discrete Programs

• Possibly a utility

• Usually scripted or run manually

• Typically short-term run time


Small Discrete Programs: Monitoring

• Screen output

• Return codes

• Catch signals

• Great example: ping & SIGQUIT

• SIGUSR1 & SIGUSR2


Signal Handling in Perl

sub USR1_handler { drop_state_file();

}

$SIG{‘USR1’} = ‘USR1_handler’;


Large Complex Programs

• Probably a daemon or interactive program

• Long running, needs to be stable

• Subject to resource change over time

• May need to retain state across restarts

• May have a web component


Large Complex Programs:Reporting

• No screen output (except debugging)

• Logging

• SNMP Agent/Traps

• (seriously, read ‘man snmpd.conf’)

• Named Pipes (FIFO)

• State Output to DB (if appropriate)


Net-SNMP Embedded Perl

perl use Data::Dumper;

perl sub myroutine { print "got called:",Dumper(@_),"\n";

}

perl $agent->register('mylink', '.1.3.6.1.8765', \&myroutine);


Immense Interconnected Software Suites

(or Large Suites)


Large Suites

• Definitely retain state across restarts

• Probably requires centralized controller

• May use sockets to communicate

• Probably has a web component


Large Suites:Reporting

• Monitoring coordinated by the “central” node or program

• Aggregation of state

• Provide layer of abstraction from any in-suite monitoring or reporting

• Provide XML/CSV in addition to human-parsable HTML pages

Everything under “Large Programs”, plus...


What we’re really doing is IPC

So what other methods exist? Lots.


Unix IPC

• Sockets

• RPC

• Message Queues

• FIFO

• Shared Memory

• And Many More...


They shouldn’t all be used...


What is important is that you use SOMETHING


What is best?

To crush your enemies, see them driven before you, and to hear the

lamentation of their women?


What is best?

• An application that is easily and openly monitored

• A developer that considers monitoring in all phases of design and development

• A developer who writes their own monitoring checks


Do us all a favor...

When you develop software, be it scripts, utilities, programs, or suites, please please please...


Do us all a favor...

When you develop software, be it scripts, utilities, programs, or suites, please please please...

Consider how we Ops folks will manage and monitor it.


Thank you for your time.

Matt SimmonsstandaloneSA on Twitter

[email protected]://www.standalone-sysadmin.com





baking-in transparency

Technology