intelligent monitoring
DESCRIPTION
This presentation describes a intelligent IT monitoring solution that uses Nagios as source of information, Esper as the CEP engine and a PCA algorithm.TRANSCRIPT
![Page 1: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/1.jpg)
Intelligent Monitoring
Denis A. Vieira Jr.
Ricardo Clemente
![Page 2: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/2.jpg)
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
![Page 3: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/3.jpg)
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
![Page 4: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/4.jpg)
Motivation:
Only ponctual monitoring available
Decrease time to repair incidents
Proactive monitoring
Realistic view from live environment
Intelligent Monitoring
![Page 5: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/5.jpg)
Motivation:
Learn (identify patterns )
Automation
Store historical data with no loss
Improve credibility and Situational Awareness
Intelligent Monitoring
![Page 6: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/6.jpg)
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
![Page 7: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/7.jpg)
Where are we?:
Lots of information (1200 servers with more than 14000 monitors)
– more than 40000 graphs being plot
Lots of tools for monitoring running (SME, IPMonitor, Cricket,
SiteScope, SiteSeer, Logs)
Difficulties with specific customizations, performance and cost
No credibility (lots of emails) with alarms. But much better than
before.
Intelligent Monitoring
![Page 8: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/8.jpg)
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
![Page 9: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/9.jpg)
Were are we going:
Use of events. E.g.: Appenders for log frameworks to integrate
information from applications
Knowledge to antecipate undesired situations
Unified interface for monitoring
Root cause detection
Intelligent Monitoring
![Page 10: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/10.jpg)
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
![Page 11: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/11.jpg)
Intelligent Monitoring
Action Plan:
Unify the monitoring tools with Nagios (scalability and integration)
Integrate Nagios with correlation system using NEB (Nagios Event
Broker)
available ate:
code.google.com/p/neb2activemq
Map event and systems to correlate
(manual and analytic task)
![Page 12: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/12.jpg)
Intelligent Monitoring
Summary:
Motivation
Where are we?
Where are we going?
Action Plan
Event Correlation
Orverview and system architecture
Event Bus
Correlation tecnique
Correlation egine
Visualization
Machine Learning
Project
![Page 13: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/13.jpg)
Overview and system architecture
Modular and event-driven architecture
EVENT BUS
CORRELATION
ENGINE
MACHINE LEARN
COLLECTOR
VISUALIZATION
![Page 14: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/14.jpg)
What is the system architecture?
Unique bus for message exchange
Modules are separte process for operating system and can be on
differente machines
Modules can publish / subscribe to queue / topic from bus
Why an Event Driven Architecture ?
Loose coupled e Distributed
Less intrusive for monitored systems
Modules are independent
Overview and system architecture
![Page 15: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/15.jpg)
Event bus
Open source project
Chosen Apache ActiveMQ:
Stable
Performance
Active Comunity
Conectivity
JMS
STOMP
REST
XMPP (...)
![Page 16: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/16.jpg)
Event Bus
Message format
JSON ( not XML)
Simplicity
Structure
Header : channel type(queue or topic) and event type
Body: data
$ curl -d "type=queue&body={'idle'=70, 'sys’=20,
'usr'=10, 'host'='ws122' }&eventtype=CPU"
http://barramento/message/events;
![Page 17: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/17.jpg)
Correlation Technique
CEP (Complex Event Processing )
Technology that enables processing mutiple events in real time with
the goal to identify meaningful events
Based on rules or queries (“SQL like”)
Queries created on execution time
History
On1995, professor David Luckham from Stanford, working on Rapide
project coined the term CEP
Database research topic: Data Stream Management Systems (DSMS)
![Page 18: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/18.jpg)
Correlation technique
Query Processing
Memory
DadosDadosData
Persistents relations
query answer
Processamento de
consultas
Memória
dados dados
continuos
queryanswer
Data stream
“upside down database”
![Page 19: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/19.jpg)
Correlation Technique
Marketing
Trend(Buzz)
CEP market is estimated on 460 milion dolars by 2010 (source: IEEE
Computer Society – April 2009)
Useful where there are data streams and necessity to extract
information on real time from that data
Financial Market
Logistic process (RFID)
Airport control
ICUs
Datacenters
![Page 20: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/20.jpg)
Correlation Technique
Big Players
![Page 21: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/21.jpg)
Correlation Technique
Open Source Players
Academic projects:
STREAM – Stanford – 2003 (officialy deprecated)
TelegraphCQ – Berkeley - 2003
Based on PostgreSQL 7.3.2
No activity
Cayuga – Cornell
From the industry:
Esper, a codehaus project complete in terms features
Compact syntax and flexible
Excelent documentation
Performance
Our choice!
![Page 22: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/22.jpg)
Correlation Engine
If session raised 10% on the
last 3 min, and the average
from Servers cpu didn’t raise
5%, and Mysql slow queries
are above 10, so there is a
database retention causing
users to queue
Application
![Page 23: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/23.jpg)
Correlation Engine
Application
Mysql
Server
Vip
t – 3 min t
t – 3 min t
t
cpu_usr
slow_query
session
![Page 24: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/24.jpg)
SELECT Server.host , Server.cpu_usr, Server_PAST.cpu_usr, Vip.session,
Vip_PAST.session, Mysql.slow_query
FROM
Server.win:time(1 min) as Server,
Server.win:ext_timed(current_timestamp(), 3 min) as Server_PAST,
Vip.win:time(1 min) as Vip,
Vip.win:ext_timed(current_timestamp(), 3 min) as Vip_PAST ,
Mysql.win:time (1min) as Mysql
HAVING
Vip.session > Vip_PAST.session * 1.10 AND
avg(Server.cpu_usr) < avg (Server_PAST.cpu_usr) * 1.05 AND
Mysql.slow_query > 10
Correlation Engine
Application
![Page 25: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/25.jpg)
Identifing na outlier
select host, free, avg(free)
from Memory.win:time(240 sec) group by host
having free < avg(free)
Events sequence
select * from
pattern [every Memory(free < 10) ->
(timer:interval(60 sec) and Log(text like ‘%OutOfMemory%’)) ]
Schedule and extensions
select idle from pattern [every timer:at(*, [16:22], *, [0,3], *) ].win:time(30
sec), CPU.win:time(30) where idle < 30 AND Filter.isInNode(id,
“Sports.BigFarm")
Correlation Engine
![Page 26: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/26.jpg)
Motor de correlação
Source: Esper Performance - http://docs.codehaus.org/display/ESPER/Esper+performance
Item Especificação
HW Servidor Esper 2 x Intel Xeon 5130 2GHz (4 cores total), 16GB RAM
VM config -Xms2g -Xmx2g -Xns128m -Xgc:gencon
Consulta # cons. evt/s Latência Latência
média
Nota
select '$' as ticker from
Market(ticker='$').win:lengt
h(1000).stat:weighted_avg('p
rice', 'volume') output last
every 30 seconds
1000 519 728 99.66% <
10us
2.8us CPU com 85%,
70 Mbit/s
Performance Esper
![Page 27: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/27.jpg)
Correlation engine
Process inside Correlaion engine
![Page 28: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/28.jpg)
Visualization – Console
Quering the live environment
![Page 29: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/29.jpg)
Visualization – Troubleshooting
Antecipating and solving incidents quicker
![Page 30: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/30.jpg)
Visualization- Dashboard
Consolidate view of environment
![Page 31: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/31.jpg)
What about unseen problems?
![Page 32: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/32.jpg)
Machine Learning
Choice for non-supervised and incremental algorithms
Incremental PCA
Transforms a number of possible correlated variables in a minor
number of non-correlated, the principal componnents
A change on principal componnents means a broken correlation, or
annomaly
Can be used for data compression
Inspired on a paper from Carnegie Mellon University (Hoke et al. 2006)
Source: http://www.pdl.cmu.edu/PDL-FTP/SelfStar/osr_sub.pdf
Implementation had two main challenges: measures with missing values
and different scales
![Page 33: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/33.jpg)
60 input signals
Machine Learning
![Page 34: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/34.jpg)
Summarized on 1 principal component + gerenation matriz
Machine Learning
![Page 35: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/35.jpg)
Second principal component
sensibility
three annomaly
Machine Learning
![Page 36: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/36.jpg)
Project
Status
Developed all functionalities
Algorithms being validated through tests with
RRDs and meeting with operation team
Performance tests on going
System on live enviroment with reduced scope
![Page 37: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/37.jpg)
Project at Globo.com – Next challenges
Scale
Events“Sharding”
Rule balance
Cache
Otimize algorithm
Adaptative control of memory and sensibility parameters
Insert a supervisioned layer
Other algorithms to cooperate
![Page 38: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/38.jpg)
Intelligent Monitoring
Final considerations
![Page 39: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/39.jpg)
References
http://delicious.com/fisl10
![Page 40: Intelligent Monitoring](https://reader034.vdocuments.site/reader034/viewer/2022051411/547a557fb4af9fb4158b4a83/html5/thumbnails/40.jpg)
Questions
Contacts
Denis A. Vieira Jr
[email protected] (www.globo.com)
Ricardo Clemente
[email protected] (www.intelie.com.br)
Globo.com stand
This afternoon
Raise your hand!