improved scripting of ids alarms and events

37
Improved Scripting of Improved Scripting of IDS Alarms and Events IDS Alarms and Events Thomas Horner Thomas Horner Senior DBA/S1 Corporation Senior DBA/S1 Corporation Informix User Forum 2005 Moving Forward With Informix Atlanta, Georgia December 8-9, 2005

Upload: anthea

Post on 06-Jan-2016

40 views

Category:

Documents


0 download

DESCRIPTION

Informix User Forum 2005 Moving Forward With Informix. Improved Scripting of IDS Alarms and Events. Thomas Horner Senior DBA/S1 Corporation. Atlanta, Georgia December 8-9, 2005. Overall Objectives. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Improved Scripting of IDS Alarms and Events

Improved Scripting ofImproved Scripting of

IDS Alarms and EventsIDS Alarms and Events

Thomas HornerThomas HornerSenior DBA/S1 CorporationSenior DBA/S1 Corporation

Informix User Forum 2005 Moving Forward With Informix

Atlanta, Georgia December 8-9, 2005

Page 2: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 2

Overall Objectives

• Enhancements to the supplied scripts

• Help prevent unnecessary late night page or cell phone call

• Be proactive in monitoring of dbspaces

• Same shells can be used for 7.x, 9.x, and 10.x IDS engines

Page 3: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 3

Presentation Overview

• What does IBM/Informix supply?• Purpose of these custom shells• Overall design of the shells• Details of the alarm shell• Changes made to evidence shell• Details of the “LookatSpace” shell• Other shells I use for administration• Limitations of these shells

Page 4: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 4

IBM Supplied Scripts

• alarmprogram.sh, log_full.sh, no_log.sh, and evidence.sh supplied by IBM/Informix

• IDS 9.4+ and 10.x alarm program is improved over the older versions– it gathers additional data for certain alarms– it sends email to and/or pages DBA– it recognizes the automatic log alarms

• First two functions are in my alarm shell, but not the last one

Page 5: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 5

IBM onconfig Parameters

• ALARMPROGRAM onconfig parameter– set to appropriate value (full path name)

• ALRM_ALL_EVENTS onconfig parameter– set to 1

• SYSALARMPROGRAM onconfig parameter– set to appropriate value (full path name)

• DYNAMIC_LOGS onconfig parameter– this needs to be 1 or 0 for my alarm shell– all available space in log dbspace allocated up front– this is a design decision

Page 6: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 6

Purpose of these Shells

• Alarm Shell– combines functions of the “default” programs

and adds features

• Evidence Shell– match design of this program with the alarm

program changes

• LookatSpace Shell– gives DBA an “advance” notice of possible

space issues

Page 7: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 7

Purpose of these Shells

• Other Shells used to monitor and administer the databases:– check database shell – quick check of

engine status– onchecks shell – perform oncheck

commands weekly– update statistics shell – perform scheduled

update statistics– prune log shell – prune online log and other

logs

Page 8: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 8

Overall Design of Shells• Alarm and Evidence Shells

– add functionality to supplied default programs– do not change how the shells are used by the

Informix engine• LookatSpace Shell

– run on a scheduled basis to check for low space that may not be obvious from simple onstat -d output

• Other Shells– run on a daily or weekly schedule to perform

other administrative functions

Page 9: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 9

Overall Design of Shells

• All Shells – can be used for multi-instance installations

and multiple production databases in one instance

– can be used across 7.x, 9.x, and 10.x engines

Page 10: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 10

Installation• These are currently installed on four

production servers and several test servers on the following versions:– IDS Version 7.24 on HPUX 10.20– IDS Version 9.21 on HPUX 11.00

• Other installations are successfully using them (based on emails I have received)

• Requires notification means to DBA team and to the Data Center

Page 11: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 11

Alarm Program – Overview

• Five parameters passed from instance:– Severity (severity)

• ranges from 1 through 5

– Class_ID (class_id)• contains the message ID that caused the alarm

– Message (class_msg)• contains the actual text of the alarm

– Additional Text (specific_msg)– Event File (see_also)

Page 12: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 12

Alarm Program – Functions Added

• Set the proper level of notification based on alarm severity

• Prevent overload of machine resources and email caused by duplicate or multiple alarms for the same issue

• Reduce “false” alarms by using mutex files• Perform logical log backups using ontape• Option for “no notification” • Alarm log file used to record alarms and actions

Page 13: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 13

Alarm Changes – Proper Notification Level

• Severity 1 or 2– no notification as recommended by IBM/Informix

• Severity 3– not critical – email is sent to the DBA team– no email if class 6, 15, 21, or 23 (more on why later)

• Severity 4 or 5– critical – data center is notified for action and an email

is sent to the DBA team for our records– no notification if class 6, 15, or 21 (more on why later)

Page 14: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 14

Stop Duplicate Alarms

• Biggest design change I made from the default alarm programs

• Classes 6, 15, and 21 can cause multiple alarms– class 6 is “non fatal” Internal Subsystem Failure– class 15 is Data Replication Failure– class 21 is Online Resource Overflow

• Idea for this change came with my first encounter with multiple class 21 alarms– caused by process exceeding available number of

locks (version 7.x engine)– hundreds of emails received within a minute – OOPS!

Page 15: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 15

Stop Duplicate Alarms (cont’d)

• Separate section of code to handle classes 6, 15, and 21

• Class 23 (logical log backup needed) also has specific section of code to perform log backups

• Shell uses distinctly named files in /tmp for these three classes of alarms:– /tmp/event${ENV}${FILENO}.`date +%H`

• Alarm is considered new if this file in /tmp does not exist or if that file is more than one hour old

• One hour threshold was a design decision

Page 16: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 16

Stop Duplicate Alarms (cont’d)

• Steps used to handle classes 6, 15, and 21:– if the alarm severity is less than 3, ignore the alarm– if file in /tmp exists and is less than one hour old:

• consider this a duplicate alarm of this class• simply log it

– if file in /tmp file does not exist, or the file is more than one hour old, this is first alarm of this class:

• follow notification protocol• create (or update) the /tmp file for this alarm

Page 17: Improved Scripting of IDS Alarms and Events

17

Alarm – Real alarm.log outputFri Jul 19 09:40:24 EDT 2002alarm.sh got event 21 severity : 3 message : OnLine resource overflow: 'Locks'. additional text: Lock table overflow - user id 106, session id 1133666 reference file :

Fri Jul 19 09:40:30 EDT 2002alarm.sh got event 23 severity : 2 message : Logical Log 15362 Complete. additional text: Logical Log 15362 Complete. reference file :

Fri Jul 19 09:40:39 EDT 2002alarm.sh got event 18 severity : 2 message : Log Backup completed: 15362. additional text: Logical Log 15362 - Backup Completed reference file :

Page 18: Improved Scripting of IDS Alarms and Events

18

Alarm – Real alarm.log output (cont’d)

Fri Jul 19 09:40:39 EDT 2002Multiple alarms - class 21, severity 3.Fri Jul 19 09:40:40 EDT 2002Multiple alarms - class 21, severity 3.Fri Jul 19 09:41:02 EDT 2002Existing class 21 issue - no notification needed.Fri Jul 19 09:41:03 EDT 2002Multiple alarms - class 21, severity 3.Fri Jul 19 09:41:03 EDT 2002Multiple alarms - class 21, severity 3.Fri Jul 19 09:41:05 EDT 2002Multiple alarms - class 21, severity 3.Fri Jul 19 09:41:05 EDT 2002Existing class 21 issue - no notification needed.Fri Jul 19 09:41:17 EDT 2002alarm.sh got event 23 severity : 2 message : Logical Log 15363 Complete. additional text: Logical Log 15363 Complete. reference file :

Page 19: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 19

Alarm – Perform Logical Log Backups

• Make sure no other log backup is running:– check for /tmp/ontape.L${ENV}, a mutex file– do not start another log backup and notify DBA team

via email if it does exist– not considered critical because this can occur

normally when logs turn over quickly– create the /tmp/ontape.L${ENV} mutex file if it does

not exist and continue

• If onconfig file has /dev/null for the LTAPEDEV onconfig, run ontape -a to free the log, then exit

Page 20: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 20

Alarm – Perform Logical Log Backups (cont’d)

• Make sure engine is up using “onstat -” command– if not follow notification protocol (severity is

critical)

• Make sure log backup device is ready– if not follow notification protocol (severity is

critical)

• Determine number of first and last log that will be in this backup file using “onstat -l” command piped to a grep

Page 21: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 21

Alarm – Perform Logical Log Backups (cont’d)

• Note any “missing” log numbers in log file

• Perform the actual log backup using “ontape -a”

• If ontape command fails, follow notification protocol (severity is critical)

• Move, rename, and compress the log backup file using gzip

• Remove the mutex file so that the next log backup can run

Page 22: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 22

Alarm – No Notification Option

• At beginning of alarm program, it looks for file named alarm.nomail in /usr/informix

• MAILFLAG shell variable is set to “on” or “off”• Before every statement where notification is to

be sent, the MAILFLAG variable is looked at• If MAILFLAG is “off”, do not send email or notify

Data Center• If MAILFLAG is “on”, send email and (if critical)

notify Data Center• You can simply remove the alarm.nomail file to

start having notifications sent

Page 23: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 23

Evidence Program – Overview

• Default (supplied) program is called evidence.sh• Normally called by engine when an assert failure

occurs to “gather evidence” for use by IBM/Informix support

• Not supplied with 7.2x engines• SYSALARMPROGRAM configuration parameter• Twelve parameters are passed to program• IBM/Informix recommends not changing the

functions of this more complex shell

Page 24: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 24

Evidence Program – Issues Addressed

• I did change the notification techniques to match those used in the alarm program

• Added the use of MAILFLAG to stop notification • Added notification for warnings (email to DBA

team) in addition to failures• Put in appropriate values for the environment

variables at the beginning of the program• I do not email the assert failure file (which the

default program does) because of its large size• Named the program evidence.${ENV} for use in

multiple instances

Page 25: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 25

LookatSpace Program – Purpose

• You may think that you have plenty of free space in a particular dbspace– one table that requests a large next extent can use up

all the remaining free dbspace– another table in the same dbspace that also needs

additional space can be “out of luck” and a SQL error will be returned to the user

• This shell looks for this type of situation and emails any issues found to the DBA team

• DBA team then has time to add a chunk to the dbspace before it becomes critical

• We run this once a week on a scheduled basis

Page 26: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 26

LookatSpace – Program Design

• Get name of database with the largest table in the instance using sysmaster SQL to get name of production database (assumes only one)

• Obtain dbspace usage using sysmaster SQL– separate out those that contain blobs for use later

• Obtain which non-fragmented tables are in what dbspace using SQL

• Obtain which fragmented tables are in what dbspace using SQL

Page 27: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 27

LookatSpace – Program Design (cont’d)

• Two lists of dbspaces are created– we do not put non-fragmented and fragmented tables

in the same dbspace

• If dbspace contains no tables or blobs, and has less than 3% free space:– assume that this dbspace contains only indexes– send email to DBA team because it is low on space

• If dbspace has non-fragmented tables:– obtain table space usage and future needs– uses sysmaster SQL

Page 28: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 28

LookatSpace – Program Design (cont’d)

• If dbspace has fragmented tables:– obtain table space usage and future needs– uses sysmaster SQL

• If space is more than 80% used, and next extent is greater than free space remaining in the dbspace:– send an email to the DBA team

• If space is more than 95% used, and next extent is greater than available dbspace:– add a warning message to that DBA team email

Page 29: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 29

LookatSpace – Program Design (cont’d)

• If dbspace contains blobs, check free space in dbspace and the number of blobs remaining

• If space available is less than 3% and number of blobs remaining is less than 20000, send an email with warning to the DBA team

• While the program goes through all these steps, a basic text report (space report) is created

• If there are no issues to report, no email is sent, but the space report is always available for review

Page 30: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 30

LookatSpace – Program Design (cont’d)

• The report is appended to each week, so a history of space utilization is available for analysis

• A future enhancement could include looking at the index dbspaces– we have had these unexpectedly fill up when there is

more than one large index in the same dbspace

• Another enhancement can be to write code to analyze the space utilization reports and obtain trending information

Page 31: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 31

LookatSpace – Sample Email

Space is low in DBSpace dbs1 with tables on Tue Sep 27 05:31:00 EDT 2005 for host sf8pdb1, instance sfarm_shm.

Table vfmtrnaudactvty next extent of 250000 pages will use all free 99997 pages in dbs1.

Table has 1499947 pages allocated, 231611 pages free, and 84.56 percent used.

Details are located in the /usr/informix/logs/checkspc.out file.

Page 32: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 32

Other Shells I Use

• Check Database Shell– checks to see if engine is up and active on a

scheduled basis– performs log move if requested (uses onmode

commands)– log move is run from another shell (to prevent issue in

case of hung checkpoint)– log move option is used in our shop for disaster

recovery purposes

• Onchecks Shell– performs basic oncheck commands on a weekly basis

Page 33: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 33

Other Shells I Use (cont’d)

• Update Statistics Shell– can choose how update statistics is run via

input parameters– temporarily changes certain Informix

environment variables to improve performance while running update statistics

• Prune Log Shell– archives various log files monthly– also archives the online.log

Page 34: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 34

Limitations of these Shells

• The shells (except alarm or evidence) are run on a scheduled basis, not on a demand basis

• The LookatSpace shell requires that fragmented and non-fragmented tables not be in the same dbspace

• The LookatSpace shell does not “predict” when index dbspaces will fill up

• Certain thresholds are “hard-coded” in the shells and may need to be changed for your installation

• Certain names of files and directories are coded in the shells and may need to be changed for your installation

• Latest enhancements of data gathering features of 9.4+ supplied alarm program are not in the alarm shell

Page 35: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 35

Review

• Alarm program– took the IBM/Informix “template” and ideas of others

and myself to make it more robust– handles multiple alarms and performs log backups

• Evidence program– took the IBM/Informix “template” and made

notification consistent with the alarm program

• LookatSpace program– helps the DBA team identify space issues before they

impact end user or become an “emergency”

• Other shells we use to monitor the engines

Page 36: Improved Scripting of IDS Alarms and Events

December 8-9, 2005 36

Questions and Comments?

• To get a copy of these shells, email me at [email protected]. I can package the files and send them to you via email.

• Objective here was to prevent the unnecessary page or phone call, that may result in fixing something that is actually not broken.

• Proactive monitoring of dbspaces using LookatSpace is better than that 3 am page requiring you to add a chunk.

• Thank you all for your attention. I hope that these shells enable you to keep better informed about the status of your production systems.

Page 37: Improved Scripting of IDS Alarms and Events

Improved Scripting ofImproved Scripting ofIDS Alarms and EventsIDS Alarms and Events

Thomas HornerThomas [email protected]@s1.com

Informix User Forum 2005 Moving Forward With Informix

Atlanta, Georgia December 8-9, 2005