connect. communicate. collaborate gÉant2 monitoring otto kreiter, dante navneet daga, dante lhc...
TRANSCRIPT
Connect. Communicate. Collaborate
GÉANT2 monitoring
Otto Kreiter, DANTENavneet Daga, DANTE
LHC Monitoring Workshop, Munich, 19.07.2006
Connect. Communicate. CollaborateAgenda
• Extraction of monitoring information from the GÉANT2 network
• External application developed by DANTE• Demonstration of a home grown weather-map• Conclusion
Connect. Communicate. CollaborateNetwork Element Manager• All network elements communicate with the NM separately • NM task is to configure and monitor one by one each NE• It is not service aware – no knowledge about the intra-domain e2e path status.
Connect. Communicate. Collaborate
Regional Network Manager (RM)
TopologyServices
Correlation“User”
interface
Connect. Communicate. CollaborateHow we export data !
Alarms
Alarms
Perf. Meas.
Rem. Inv.
Connect. Communicate. CollaborateStatus via alarms
Alarms
SNMPTrapD
Alarms
Monitoringstation
Connect. Communicate. CollaborateAlarm content
• From the NM:– Information about interfaces and associated signal
status, SDH timing problems– NE and ILA status
• From the RM– Information related to services– Information related to path, trails and physical
connections at all layers
Connect. Communicate. CollaborateOne hop case NMS vs JRA-4
Path – gen_mil_CERN
OCH trailPhys-link Phys link
Domain linkP. ID link P. ID link
BOL-CERN-LHC-001
Connect. Communicate. CollaborateMultiple hop case NMS vs JRA-4
Path – gen_mil_CERN
OCH trailPhys-link Phys link
Domain link P. IDLink
CERN-SARA-LHC-001
OCH trailPhys-link
P. IDLink
Connect. Communicate. CollaborateAlarm processing
• SNMP traps from the Alcatel IOO module.• Alcatel Enterprise v1/v2c MIB• SNMP traps received by a Linux station
– snmptrapd to pick up all alarms– For each trap a bash script is called which performs:
• Analysis• Selection• Action
Connect. Communicate. CollaborateAlarm type & information
Alarm Raise:– friendlyName– probableCause– perceivedSeverity– currentAlarmId– eventTime– acknowledgementStatus– additionalInformation– eventType– snmpTrapAddress
Alarm Clear:– friendlyName– probableCause– currentAlarmId– eventTime– snmpTrapAddress
Connect. Communicate. CollaborateUsed alarm information
Alarm Raise:– friendlyName– probableCause– perceivedSeverity– currentAlarmId– eventTime– acknowledgementStatus– additionalInformation– eventType– snmpTrapAddress
Alarm Clear:– friendlyName– probableCause– currentAlarmId– eventTime– snmpTrapAddress
Connect. Communicate. CollaborateAlarm analyzer process
SNMP trap received
snmpTrapAddress Must be registered
Check for type Of Alarm
Raise
Additional Infopath
clientpath
ochtrail
omstrail
physicallink
recordAlarm
Call External Program
Clear
alarmID
Read recordAlarm
Call ExternalProgram
Record all traps
delete recordAl
friendlyName friendlyName
Connect. Communicate. CollaborateAlarm analyzer
• Called every time a trap is received• Written in bash• Each trap is analyzed separately and if in the meantime a
new trap arrives it waits in the queue (snmptrapd)– Possible problem if an external program get stuck and
the scripts hangs. The alarms remains unprocessed in the queue
• Must maintain state– SNMP traps may get lost so a program needs to check
time to time if the monitoring station is in syncro with the NMS.
Connect. Communicate. Collaborate
XML file generation
Connect. Communicate. CollaborateE2E Data transformation• Prototype applications developed in Java –
– E2EXMLWriter– XMLGenerator
• E2EXMLWriter performs 2 functions – – Takes in a template XML and produces an XML file containing live e2e
path status information conforming to the JRA4 e2e data model. – Feeds a perfSonar MA with live path status information.
• E2EXMLWriter is triggered by a script listening to SNMP alarms– Parameters passed
• Trail ID• Status
• XMLGenerator produces this template XML that E2EXMLWriter uses to export domain’s e2e information
Connect. Communicate. CollaborateDesign of E2EXMLWriter
• Relies on 2 configuration files to produce live XML status information– Properties file (links.properties)
• Properties file containing key = value entries• Each key is one e2e path name• Value to each key is a csv of multiple trails that form one
Domain Link and/or Partial ID Link• Currently manually maintained
– Alarm register• A simple csv file• Application maintained• An “alarm raise” registers the associated path• An “alarm clear” de-registers the associated path
(contd).
Connect. Communicate. CollaborateDesign (contd.)
• The application sets all path’s default status as UP with admin state as NORMALOPERATION
• Only the paths “registered” in the alarm-register csv file are set as DOWN with admin state as MAINTENANCE
• No implementation of the status DEGRADED at the moment
• No implementation of other admin states at the moment
Connect. Communicate. CollaborateDesign of XMLGenerator
• Relies on 3 configuration files – – Properties file (init.properties)
• Contains a key = value entry• Key = DOMAIN• Value = <domain_name>• Enables on-the-fly domain name configuration
– Config file (config.csv)• A simple CSV file• Contains node-link-node information
– A sample XML file containing “pieces of XML” to be replicated for each node and link in the final output “template XML”
• All configuration files are currently manually maintained
Connect. Communicate. Collaborate
Monitoring data processing “e2e path”
Connect. Communicate. CollaborateLHC weather-map live demonstration
1. CERN user-side down
2. CERN user-side up
3. GEN-MIL Lambda down
4. GARR user-side down
5. Back-to-back interconnection in DE broken
6. AMS-FRA lambda down
7. Up DE interconnection
8. AMS-FRA lambda up
9. GARR user-side up
10. GEN-MIL lambda up
Connect. Communicate. CollaborateConclusion
• Status monitoring via SNMP alarms in an advanced phase and well understood.– Once the characteristic of the equipment/alarms/faults
understood the development was easy.
• XMLGenerator not bonded to a specific equipment and can be used together with the JRA-4 MP and/or to feed an perfSONAR MA
Connect. Communicate. CollaborateT0-T1 CERN-CNAF
GARR GÉANT2CERN(CH)
CNAF(IT)
Connect. Communicate. CollaborateTechnologies
CERN-CNAF-LHCOPN-001
GÉANT2 GARR CNAFCERN
Domain linkP. ID LinkP. ID LInk P. ID Link P. ID Link Domain link
F10 1626 LM 1626 LM M320 M320 C6509
Connect. Communicate. CollaborateDomain I – CERN
• Partial ID Link corresponds to the status of the port• MP developed by Martin Swany - export port status information
CERN
P. ID LInk
F10
Connect. Communicate. CollaborateDomain II – GÉANT2
• Partial ID link – status of the ports facing the adjacent domains• Domain Link – status of the lambda• perfSonar MA and GN2-JRA4 MP used to export status
information
GÉANT2
Domain linkP. ID Link P. ID Link
1626 LM
Connect. Communicate. CollaborateDomain III - GARR
• Inter Domain Link – status of the port facing GÉANT2• Domain link – status of the LSP between the two routers +
status of the interface facing CNAF (T1)• GN2-JRA4 MP used to export measurement data
GARR
P. ID Link Domain link
Connect. Communicate. Collaborate
View on the E2E monitoring system
Connect. Communicate. CollaborateConclusion
• Fairly easy to establish the monitoring of the E2E path.– It took around two phone conf with GARR + around 10 e-mails– 3-4 phone conf with CERN and Martin Swany + around 10-15 e-
mails– All parties were extremely familiar with their equipment and the
required softwares.
• Questions started to pop-up if we need to monitor an End-Point and how should we do it ?– Is an EP a simple client ?– Or we shall redefine the “Client” as somebody who actively
participate in the e2e monitoring
Connect. Communicate. CollaborateBackup
Connect. Communicate. CollaborateCERN user side down
Connect. Communicate. CollaborateLambda CH-IT down
Connect. Communicate. Collaborate
Lambda and user failure in IT
Connect. Communicate. Collaborate
Lambda + POP interconnect failure
Connect. Communicate. Collaborate
Multiple Lambda, user and POP interconnect failure