patrol/ranger update

19
Tapem ove1 S D Sun EN T E 1 50 Tapem ove2 Sun Shire S D 450 E NT ER PR IS E S un S P C D V U R A E450 Objyserv1 Sun SP A RC st o ra g eL ib r ar y A3500 .5TB Patrol/Ranger Update Chuck Boeheim Assistant Director SLAC Computer Services

Upload: echo-hyde

Post on 30-Dec-2015

28 views

Category:

Documents


0 download

DESCRIPTION

Patrol/Ranger Update. Chuck Boeheim Assistant Director SLAC Computer Services. History. Patrol originated in 1994 Originally only to renice processes Extended to monitor filesystems, daemons, and to perform more notifications/repairs - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Patrol/Ranger UpdateChuck Boeheim

Assistant Director

SLAC Computer Services

Page 2: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

History

Patrol originated in 1994• Originally only to renice processes• Extended to monitor filesystems, daemons, and

to perform more notifications/repairs

Downloaded by over 300 sites, in production use in about 20 known sites

Page 3: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Limitations

Original rules language simple, columnarPC afs[0-9]* 50 log,mail(unix-admin)

Difficult to extend to express complexities• E.g., renice processes using more than 20% of

the CPU if the load average is over 3.

Written in Perl4, limited by not having complex data structures

Page 4: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

The Rewrite

Update to Perl5 Introduce new rules language Introduce extensible data collectors Rename to System Ranger

Page 5: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Rules file structure

Config section supplies local customizations

Ruleset sections defines data collectors and the set of rules to be applied to them

Message section defines message texts

Page 6: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Config section

Supplies the common customizations made at other sites

config

{

optsfile(/etc/tailor.opts) path(/usr/ucb:/bin:/usr/bin)

mailfrom('The System Ranger <root>') mailreply(’Unix Admins <unix-admin>')

}

Page 7: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Rulesets

Rulesets name a set of rules and associate them with a data collectorRuleset(anyname) collector(process)

{

list of rules...

}

Builtin data collectors are: System, Process, Daemon, User, Filesystem, File, Service

Custom collectors are planned

Page 8: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Rules

A rule is a set of function calls in bracesRule { cpu(gt,50) kill() log() }

Functions return SUCCESS or FAILURE FAILURE causes remainder of rule not to

be executed, execution passes to next rule A rule that succeeds ends processing of the

ruleset unless the CONTINUE function appears in it.

Page 9: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Rules

The word OR may connect functionsRule { cpu(gt,50) or size(gt,20M) kill() }

A sequence of functions in braces returns SUCCESS or FAILURE for the entire sequenceRule {{cpu(gt,50) kill()} or cpu(gt,25) log }

A sequence of functions in brackets always returns SUCCESS• Rule { cpu(gt,50) [size(gt,10M) kill] log }

Page 10: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Selection Functions

Apply to specific machines:• host• option• arch• test

Apply to specific instances:• user• group• name

All tests may be negative or positive e.g., host(icarus) or user(!root)

Page 11: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Comparison Functions

Determine when thresholds crossed• cpu - percent of CPU• size - memory or file size or rate of change• time - total CPU time

Or test global values• loadavg, numusers, numprocs, uptime

Have optional first argument specifying comparison: gt, lt, eq, etc.

Page 12: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Action Functions

Specify some action to perform• log• mail• page• kill, signal (by pid or name)• nice

Page 13: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Sample Process Rules

Rule { host(www.*) pct(gt,10) or size(gt,20M)

mail(PROC_REPORT,www-monitor) mcons(info) log

}

Rule { {time(gt,6h) kill mail(OVERLIM, $user)} or {time(gt,4h) mail(WARN2, $user)} or

{time(gt,2h) mail(WARN1, $user)}

}

Message OVERLIM <<EOF

The CPU limit for $host is 6 hours. Your

process $pid $cmd has been terminated for

exceeding the limit.

<<EOF

Page 14: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Sample Filesystem Rules

Rule { name(/u[0-9]) pct(gt,99,90+1) page(admin)}

Rule { host(afs[0-9]+) name(/vicep.*)

{ host(afs07) name(/vicepg) } or

{ host(afs08) name(/vicepf) } or

{ pct(gt,98) mail(FSFULL, admin) }

}

Message FSFULL <<EOF

File system $name is $pct% full, grew by $delta%.

EOF

Page 15: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Sample File Rules

Rule { name(/var/adm*) size(gt,1M) page(admin) }

Rule { name(/etc/passwd) md5()

mail(PSWDCHG, admin)

}

Message PSWDCHG <<EOF

File $name has been changed!

EOF

Page 16: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Sample Daemon Rules

Rule { name(nfsd) number(ne,8) page(admin) }

Rule { name(pud) number(lt,1) restart(pud) }

Rule { name(amd) number(gt,1) page(admin) }

Page 17: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Sample User Rules

Still somewhat experimental

Rule { user(!root) number(gt,3) pct(gt,50)

mail(CPUHOG, admin)

}

Message CPUHOG <<EOF

User $user has $number processes using $pct%

of the CPU on $host.

<<EOF

Page 18: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Why Ranger?

Some automatic monitoring is needed Commercial packages are complex and

expensive Ranger does a lot in a small package Because it’s cool

Page 19: Patrol/Ranger Update

Tapemove1

SD

SD

S u n

E NTE RP RI SE15 0

Tapemove2

SD

Sunm ic ro s y st e m

Shire

SD

4 5 0EN TE RP RI SE

S unSPA RC

DR IVE NUL TRA

E450Objyserv1

SD

Sun

S PAR Cs tor age Li bra ry

A3500.5TB

Availability

Needs a bit more shakedown at SLAC before distribution

Look for via http://www.slac.stanford.edu/~boeheim

Will be starting a mailing list; send email to be included