zabbixreference-actualizado

ITA Zabbix Reference

17/06/2014

01 Network Related

01.001 Unreachable The host can not be reached by ICMP Ping from our monitoring server. Action: You must carefully why the host can’t be reached. Contact: POC of the Device

01.002 Smokeping Packet Loss detected Smokeping is sending 20 ICMP ping’s to one host. If a certain number does not return you will get this alarm. This means that the host is generally speaking UP,but doesn’t have a 100% functional communiction line to the monitoring Server. Action: Carefully check where the packet loss is initialted. Try ‘mtr’ from noc1 to the host. Also check the zabbix packet loss graph to see if this started at a specific point. Contact: Core and/or Infrastructure.

01.003 Smokepign High Latency Warning (local/national/international..) Smokeping is sending 20 ICMP ping’s to one host and also measures the average roundtrip time of each ping. On a local network (withing luanda or similar) we expect a very low latency (e.g. < 30ms), national (like luanda <> benguela) it can be upto 55ms and internationally it can be serveral hundreds milliseconds. This alarm shows that the actual latency is higher than we expect. Action: Veryfy where the latency is coming from and test carefully also for packetloss. Contact: Infrastructure / Core

01.004 Low Throughput on backhaul links The trigger measures the actual bandwidth/throughput of a link (== how much data is currently flowing).

If this is lower than expected you will get this error. The expectation depends perlink. Check the the severity. Action: Verify if there is any problem. If it’s on a Radiolink check radio parameterand or other alarm Contact: Infrastructure and/or Core

01.005 Low Thoughput on International Links The trigger measures the actual bandwidth/throughput of a link (== how much data is currently flowing). If this is lower than expected you will get this error. The expectation depends perlink. Check the the severity. For planned outage however you should be aware beforehand (check email)!! If not this should be escalted immediately to the coreteam and to the link provider (e.g. Angola Telcom / Angola Cables, Satellite provider) Action: Contact core + Provider

02 Routing Device Related (Cisco/Juniper…)

02.001 CPU Temperature high If a temperature of a CPU goes above a certain level the router might get damaged or might shutdown automatically. Action: Verify the temperature by logging into the router. Check the room temperature sensors (evtl intake temperatures of same device). Likely the AirCon failed in the room. Contact: Infrastructure Department (in case of AC failure), Core if other reasons (like fan Error etc..)

02.002 Fan Errors or Malfunction Fan’s are keeping the air inside a Router circulated. To have them functional is critical to keep a device cool. If fan’s are malfunction you likely see the CPU temperature going up. Action: Verify the fans by logging into the router Contact: Core

02.003 Intake/Inlet Temperature High This is the air temperature when entering the router for cooling. This should be same or similar as the normal Room Temperature. In an air conditioned environment this should not exceed 27 degrees.

Action: Verify Room Temperature or temperature sensors, Verify CPU temperature of device Contact: Infrastructure

02.004 CPU Load If a router is very busy handling requests the CPU Load will increase. If it is to high, the router will start to malfunction (e.g. dropping packets etc..), because it’sCPU is not strong enough to handle the requests. Normally the CPU load should stay below 60%. We have a serious problem if above 90%. Action: Verify Alarm Contact: Core

02.005 Reboot Detected We detected that a device rebooted, this could have many causes, e.g.: Someone doing maintenance, Power Problems, Router hardware or software problems. A reboot must be checked carefully and action must be taken! Action: Verify the reboot (check uptime) Contact: Core and/or Infrastructure

02.006 Outlet Temperature Normally not measured, but this is the outlet temperature of a device. So Intake > inside(cpu) > outlet … should be higher than inlet temp. Action: See 02.003

02.007 Temperature Check A combinded temperature check done by the router. Action: Verify by logging in, check temperature of other devices in same POP Contact: Infrastructure

02.008 Chassis Temperature The temperature inside the chassis of the device. Action: Verify by logging in, check temperature of other devices in same POP Contact: Infrastructure

03 Radio Problems

03.001 Low Signal / Low RSL The signal level received is not as good as expected. This can cause link degradation. Also check the severity, it might depend on the actual seen signal level. This trigger can be seen on various technolgies, Microwave, VSAT …. Many high frequencies are sensitive to Rain. If there is currently heavy rain, thereis no need to escalate this! Action: Login to device, confirm signal degradation. Check on actual packet loss.Contact: Infrastructure or VSAT

03.002 Low Capacity / Bitrate It has been detected that due to the radio conditions a link (most likely a p2p) does not have the expected or required bandwidth to support the network. Action: Check the radio for signal degradations, verify the problem, check for packet loss Contact: Infrastructure

03.003 Errors on Microwave Radio Link (ES/SES/UAS) The error rate is higher than expected. This can be caused by bad weather / rain. Action: Escalte to Infrastructure Department

04 Power

04.001 No network Power Applicable to UPS systems this indicates that there is currently no input power and the system is running on batteries. So under normal this means there is no power from the municipality and the generator did not start! Action: Login to UPS, verify alarm, check battery voltage level and uptime. Contact: Infrastructure Team. If in the provinces the province standby team.

04.002 Battery voltage low There is no input power [04.001] and additinally the Battery Voltage is low, meaning the site might go down very soon. Action: Verify Alarm, take highest priority action to contact infrastructure team togo to site and get the generator started

Contact: Infrastructure

04.003 UPS Alarms / Major / Minor / Critical The UPS reports and alarm of the mentioned severity. Action: Login to the UPS and read the alarm description. If critical escalate immediately to infrastructure Team

04.100 Genset Battery Voltage The measured battery voltage of a Generator is low. That can cause the generator not being able to start on power cuts! Action: contact infrastructure to check

04.101 Genset Maintenance Due The generator has been running for a long time and it must go for maintenance. Action: Email Infrastructure Make sure they do reset the counter after maintenance was performed. Then this alarm will disappear.

04.102 Genset Engine Running The generator is currently running.

04.103 Genset Engine running for > 48h The generator has been running without stop for more than 48 hours. Action: Escalate to Infrastructure to verify with the property owner if there’s a power problem.

05 Radio Access Network Related

05.1XX Alvarion/Telrad 4Motion Related

05.101 AAA Low Latency Disabled The Base station requires a low latency to the AAA server unless a specific setting is set. Unfortunately the setting does not survive a reboot of the base station. If you see this alarm in the provinces we have a problem.

Action: Enter the BS using Telnet and enter the following command: NPU login: root Password: npu# conf t npu(config)# authenticator eaptransferinterval 5000 npu(config)# exit npu# If error does not disappear, escalate to systems team! URGENT

05.102 AAA Switched A base station authenticates CPE’s using RADIUS to a AAA server. We have two of them, one at lda6 and one at lda11 (aaa1.lda6, aaa1.lda11). Now the base station decided to use the alternative AAA server for authentication. Action: Check why this happened and if other base stations decided to do same.Watch carefully for any events or complains on the wimax link. Watch carefully the affected BS for registered subscribers. Contact: Core (network issues) or Systems (aaa issues)

05.103 No subscribers on a specific sector Almost all of our base stations have multiple sectors pointing to different directions. This error indicates that there is one sector that does not have any CPE’s connected. This error can happen on very empty base stations (example on weekends or power cuts). But mustn’t happen on busy ones! Action: Log into the base station, verify the number of subscribers (sh ms info) also check on zabbix (Click on the trigger > simple Graph) the number of usually connected CPE’s. Login to alvaristar and check for errors on the BS. Contact: Infrastructure

05.104 Low GPS Satellite Count The base station needs a working GPS to have the subscriberunits sychronized over the network. If the GPS can not see enough satellites for a long time the system might get out of sync and misfunction. Action: Check weather conditions, this can be normal on heavy rain. If not escalate to Infrastructure in business hours. Contact: Infrastructure

05.105 ODU errors

The system detected an ODU (Outdoor Unit) error. This likely affects services. Action: Login into the base station and escalate immediately if confirmed. Contact: Infrastructure

05.106 AU Error The system detected an AU error. This likely affects service. Action: Login into the base station and escalate immediately if confirmed. Contact: Infrastructure

05.107 MS Counter Difference The number of registered CPE’s on a Base Station suddenly changed. This can be normal in many cases (example shortly to 8am when business open), but if this happens repeatedly and over other base stations, too, urgent attention required. Action: Check the Simple graph of registered CPE’s and see if abnormal or not. Contact: Infrastructure

05.108 No MS Registrations There is not a single CPE registered to the entire base station!!! Unless on very new base stations this normally indicates a problem. Action: Verify the base station for errors, Verify the bearer network, verify the AAAContact: Infrastructure and/or Core

05.120 AAA Processing Requests This monitors the AAA server(s) required for CPE authentication. If a AAA servershould stop working you should see a lot of [05.102]. If this happens escalate to systems team as it’s highly important that we have both servers operational at all time. Action: Check for possible network issues that could have caused this. Contact: Core (if network issues) or Systems!

05.200 Alvarion/Telrad FDD Micro Base Stations

05.201 ODU not operational The Outdoor Unit reports that it’s not operational. This most likely affects service! Action: Login and check ODU status or eventual log messages Contact: Infrastructure

06 Server 06.001 Low Memory

Server reports low memory (RAM), this will likely impact on the performance of the server and might even make it unresponsive! Action: Contact Systems department

06.002 High System Load The ‘busyness’ of a UNIX server is indicated by the ‘System Load’ (== the number of processes in the running queue scheduler). The expected value can be very different from server to server depending what task they run. Mostly a value of about 3.0, m3ms can go up to 30 without performance problems. If this trigger is up for a while and severity is HIGH escalate. Action: If possible check if system is repsonsive any doubts escalate according to the severity. Contact: Systems

06.003 Standard TCP port test Most server provide network services reachable on TCP. Serveral tests exist to test if a TCP port us up and listening. You can verify all ports using TELNET. http TCP/80 Web services https TCP/443 Web services over SSL imap TCP/143 Mailbox access imaps TCP/993 Mailbox access over SSL pop3 TCP/110 Mailbox access pop3s TCP/995 Mailbox access over SSL smtp TCP/25 Simple Mail Transfer Protocol smtps TCP/25 Simple Mail Transfer Protocol over SSL ssh TCP/22 Secure Shell (admistration) (there might be others) Action: Check if it affects customers Contact: Systems

06.004 IPA Replication Down IPA is the system we use for centralized authentication. Any change replicates tovarious other servers across the Group (ITA/ITZ/ITN…). This tests will check if itworks. Action: Contact Systems!

06.005 Mailserver Mailq

This monitors the current length of the queue of mails waiting to be delivered. If this exceeds a certain number most likely this is caused by malware that abuses the server to relay SPAM. This can not really avoided automatically because they normally use valid authentication of a hacked account. Requires immediate action by systems team to find the hacked mailbox and block it and delete the spam mails from the queue. If we do not take action the mailserver will endup on a blacklist and all mail can get blocked Disaster! Action: Contact Systems!

06.006 Mailserver Test This is a script testing the general functionality of a mailserver. If raised, mailserver most likely is not functioning properly and customers are affected. Action: Contact Systems immediately!

06.007 Webmail Test This is a script that emulates the usage of the webmail (e.g. http://webmail.maxnet.ao) if fails most likely the webmail is broken. As customers likely use this, immediate action required. Action: Contact systems

06.008 Backups The servers are configured to do perform automatic backups to remote locations. A script checks if the backup was made successfully. If not this requires attention by the systems team. Action: Contact systems team next office hours. Alarm will disappear once a recent backup is found.

06.100 MBMS related

06.101 mgraph poll This is a periodic script that is polling the gateway routers and creates bandwidthgraphs. All the graphs (except the iDirect ones) are created and updated by this script and accessible by m3ms, the customer portal and the reseller portal. If this script does not run at least once every 7 minutes we have gaps in the graphs!! This has to be avoided! Action: Contact system depatement as quick as possible!

http://www.google.com/url?q=http%3A%2F%2Fwebmail.maxnet.ao&sa=D&sntz=1&usg=AFQjCNEGA3R-l72OwHLEo0kGWmo9BnN3sQ

06.102 idgraph poll This is the equivalent script to [06.101] for iDirect. It will poll the iDirect hubs andcreate graphs for each idirect remote. If this script does not run at least once every 7 minutes we have gaps in the graphs!! This has to be avoided! Action: Contact system department as quick as possible

06.103 idgraph discover This is a script running periodically. It polls the iDirect HUBS looking for new/unknown iDirect Remotes (Modems) and if found will update the mbms database. If it doesn’t run for a while you get the error and as a result new comissioned links will not appear in m3ms. Action: Contact Systems on next business hour

06.104 idgraph update This is a script running periodically. It check the local iDirect remotes (Modems) database, then polls the HUB and looks for updates such as on iDirect name, Ethernet address and a few other things. If it doesn’t run we might get some outdated (non critical) information on m3ms Action: Contact systems on next business hour

06.105 mbms mail sender On many ocasions mbms is able to send mails. One example is the ticketing system. All mails generated are first saved to a spool database, then they get sent out via a script. If the monitoring system detects that the spool database has slightly old data which has not been sent, this alarm will go on. Be aware that ticketreplys are NOT SENT if this alarm is on. Action: Contact systems department

06.106 mgraph monitor This is the script that monitors all customer circuits defined in m3ms. It is supposed to run like every 5minutes and will try to detect using various methodsif a circuit is up or down. It will then mark the circuit as either UP/DOWN or if it can’t monitor it as UNMONITORED. If this script doesn’t run you will see outdated information regarding a customer circuit status. Also a customer/reseller has access to this information on the customer portals and might confuse them.

Action: Contact systems department

06.107 sugraph poll This script polls the Alvarion/Telrad FDD Wimax base stations for the signal levels of the CPE’s. It will then create the signal level graph which also has influence on the signal level indicator and the color of the “dot” in the radio map.This informations are internal use only. If this script doesn’t run we will get gaps in the graphs and see outdated information on m3ms. Action: Contact Systems

06.108 sugraph 4xml This is the equivalent to [06.107] but for the Alvarion/Telrad 4Motion base stations. In this case the information will be extracted from a performance XML file which is created every 15min by the base station. If this script doesn’t run there will be no gaps in the graphs because it keeps file history. So it’s not critical. Action: Contact Systems

06.109 sugraph 4mnpu A periodic script that polls the Alvarion/Telrad 4Motion base stations’s NPU for alist of connected CPE’s. It then performs various updates on the database, e.g. to which sector/basestation a CPE is connected, discover new CPE’s etc.. If this script doesn’t run we might have outdated information in the database. Mostly uncritical. Action: Contact systems

06.110 sugraph bsgraph This periodic script will take all signal levels of a CPE and average them over a base station and a graph. This graph can be seen in m3ms tools > wimax > base station statistics If script doesn’t run we might get gaps in the graph. The information is for internaluse only. Action: Contact systems

06.111 Postgresql

Postgresql is the database we use for mbms. Everything links here, if postgresqlis down many things will not work: m3ms, customer/reseller portal, email service, dns update, all the mbms scripts etc… DISASTER. Escalate at any time immediately. Action: contact Systems immediately.

07 Infrastructure

07.001 Battery Voltage Power Generator The measured battery voltage of a Generator is low. That can cause the generator not being able to start on power cuts! Action: contact infrastructure to check

07.002 Maintenance Due The generator has been running for a long time and it must go for maintenance. Action: Email Infrastructure Make sure they do reset the counter after maintenance was performed. Then this alarm will disappear.

08 ITA Office 08.001 SIP Peer unreachable

This is related to our PABX. It’s interconnected to the provider using SIP trunks. Also Phones are connected using SIP. The trigger monitors important SIP trunks that mustn’t fail, for example ‘Mundo Startel’ is providing us with our main office telephone number. If that one is downwe can’t place or receive calls! Contact: Systems Department, Severity: Depending

zabbixreference-actualizado

Documents