itil® v3 event management - bcs.org® v3 event management — a look at the theory (from the real...

32
ITIL® v3 Event Management — A Look at the Theory (from the Real World) Brenda L. Peery, 14 th September 2009 BCS Specialist Group Session, All copyrights acknowledged. ITIL ® is a Registered Trade Mark of the Office of Government Commerce, and is Registered in the U.S. Patent and Trademark Office

Upload: ngotruc

Post on 22-Jun-2018

220 views

Category:

Documents


0 download

TRANSCRIPT

ITIL® v3 Event Management —A Look at the Theory (from the Real World)

Brenda L. Peery, 14th September 2009BCS Specialist Group Session,

All copyrights acknowledged. ITIL ® is a Registered Trade Mark of the Office of Government Commerce, and is Registered in the U.S. Patent and Trademark Office

An ‘event’

Not here for the tents and soundstage?

What is worth taking from that as we go forward to look at our idea of Event Management?

• It looks like it could be a bit muddy– Very broad definition– Obscure language

• But there is an idea of purpose …– [from ‘Event Management’, Wikipedia]

“to market themselves, build business relationships, raise money or celebrate”

Speaker’s Background

• 15+ years experience with IT Service Related projects and roles – both vendor and user sides with Event & Systems Management related work

• ITIL v2 Manager, v3 Expert, MSP and Prince2 Practitioner, ITIL instructor, APMG committee member developing new ITIL credentials

• As an independent consultant for the last 5 years, “IT Service Management Architect” is my favourite title thus far …

Main Topics / Goals

• Event Management –you may already know it and have it – Monitoring and Event Management (key relationship)

• Event Management – the Basics according to ITIL®

• Where EM fits & What to consider in doing it– First ask why – strategy– Planning and managing

• Evaluation of the need• What are you trying to solve / what need are you trying to serve• Define a model and develop a strategy

Initial Context – Familiarity?

• Event Management (EM) as a core process is new with v3 ITIL with some roots in v2

• What elements are familiar?

© Crown copyright. Reproduced from the OGC's ITIL® version 2 volume: ICT Infrastructure Management and version 3 Core volume Service Operation. All rights acknowledged.

Initial Context – Monitoring?

• Almost everyone has some familiarity with “Monitoring”

• Consider monitoring and management over the last decade:

– Systems Management software tools: IBM Tivoli (particularly TEC), CA NSM, BMC Patrol

– the reporting capability of underlying Operating Systems: log files and system utilities, Task Manager in Windows, the “top”command in Unix

– And never underestimate the diagnostic scripts that your SysAdmins have written or inherited

(Illus.) Ops Bridge Monitoring

Monitoring

Other kinds of monitoring?

• Other IT?• Other sector?• Inventory?• Business monitoring?

• Projects to bring in & Manage that monitoring

•Why do we do it?

Initial Context – HistorySo even though Event Management is ‘new’ there are some challenges – in creating a process model – from the back history that comes along with your infrastructure:

• There may already be strategies in place and benefits being realised from monitoring programmes

• There are likely to ‘competing understandings’: – what events are – what you are or are not doing about them and – at what levels you are engaging to monitor and utilise them

• Stakeholders may range from in-depth technical all the way up to non technical consumers of the information EM can produce

Your back history, embedded in your kit, will shape or constrain your EM possibilities

Best Practice Benefits

Develop a shared understanding and common language based on best practice recommendations, at least as your starting point …

EM Basics 1 – EM Process

“Event Management is the process that monitors all events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate exception conditions” (SO p.35).

So it is about:– Detecting events– Making sense of them– Determining appropriate control actions in response to them

But also:– Acting as a basis for automating routine Operations Management, and– Because it provides data for comparison, supporting

• Service Assurance and Reporting• Service Improvement

Event Management - Value“Generally indirect” (SO p.39)

• EM provides mechanisms for early detection of incidents (possibly action before any impact felt)

• EM provides a basis for automated operations• EM provides a basis for monitoring automated activity by

exception – Reducing the need for “expensive and resource intensive real-time

monitoring while reducing downtime”• Improves performance of other major Processes (early

responses, more business benefit from more effective and efficient ITSM)

EM Basics 2 – Event Definition

What is an Event?

Any detectable or discernible occurrence that has significance for the management of the IT infrastructure or the delivery of IT service and the evaluation of the impact a deviation may cause to a service.

Events are typically notifications created by an IT service, Configuration Item (CI) or monitoring tool.

(SO, p.35-36)

EM Basics 3 – Event Definition (Breadth)

Checking the official scope doesn’t narrow it down much:

“Event Management can be applied to any aspect of Service Management that needs to be controlled and which can be automated” (SO p.36).

EM Basics 4 – Event TypeBut there is more detail – the guidance suggests that you sub divide Events and “that at least these three broad categories be represented” in your Event Types:

1. Informational• There is no action required• Signifies regular operation (not an exception).

2. Warning• Approaching a threshold. • Signifies unusual, but not exceptional, operation

3. Exception• Abnormal operation. Breach of parameters.

Note also: Alert (to trigger human attention or intervention)[SO, p.40]

EM Basics 5 Process Flowchart

End

No

Event

Event NotificationGenerated

Event Detected

Event Filtered

Significance?

Warning

Informational Exception

Event Correlation

Trigger

AlertAuto ResponseEvent Logged

Human Intervention

Type?

Problem ManagementIncident Management Change Management

Yes

Review Actions

Effective?

Close Event

IP

C

© Crown copyright 2008. Reproduced from the OGC's ITIL® core volume: Service Operation. All rights acknowledged.

EM Basics 6 – Process Activities Summary

IP

C

Event Occurs –Notification / Detection

Filtering (Categorisation)

Correlation (Logic/rules)Note: Load

Trigger / Response SelectionNote: Human Perception

Review / Close

EM Basics 7 – Events and Infrastructure

Consider the extent to which your process design is and must be connected to your installed architecture

– Notification/Detection: How are you detecting and how are notifications sent or collected (and what impact does this have)?

– Filtering/Categorising: events into I , W , E streams, ignore event (or log/record locally)

– Triggering an Alert, Auto Response, or related Process (does your architecture allow this?)

EM – Lifecycle & Summary

ServiceStrategy

ServiceDesign

ServiceTransition

ServiceOperation

CSI

In the Lifecycle concept that is at the heart of v3 ITIL, the Event Management process is seated in Service Operation with the full set of SO processes including:– Event Management– Incident Management– Request Fulfilment– Problem Management– Access Management– Operational aspects of other

Processes

The EM & Monitoring Relationship

If we revisit the basic defintion:

“Event Management is the process that monitors all events that occur through the IT infrastructure to allow for normal operation and also to detect and escalate exception conditions” (SO p.35).

While the Service Operation book provides a high level model of a ‘sample’ EM process, have we really looked at its key activity sufficiently ...

Designing EM – Alternate Lifecycle

ServiceStrategy

ServiceDesign

ServiceTransition

ServiceOperation

CSI

“In an ideal world, the Service Design process should define which events need to be generated and then specify how this can be done for each type of CI. During Service Transition, the event generation options would be set and tested”. (SO p.39)

Monitoring and Infrastructure

The base monitoring architecture:• Agent based• Agent less• A sample of an evolved monitoring

architecture

Agent based

Server (Windows/Unix)

applog

systemerrorlog

disks

Process 1

UP?Mem?CPU?

Process 2

Process 3script CMD

Monitoring Server

ConfigHistory* Alerts* Metrics

storedconfig

AlertsMetrics

config(once)

Hub /gateway /Monitoring

Server Agent

GUI Console

Advantages:* Technically more efficient* Possible offline operation* Often Richer in Functionality

Disadvantage:* More complicated to install* Agent disk footprint

Agent-less Advantages:* No agent to install -> easy to install* No Agent Footprint

Disadvantages:* More load on monitored machine* Less resilient to network problems

Monitoring Server

ConfigHistory* Alerts* Metrics

Aler

tsMetr

icsWeb Console

Cross-MachineScheduling Loop

Server (Windows/Unix)

applog

Rescan file

systemerrorlog

rescan file

disksCheck disks

Process 1

relist processes & filter

Process 2

Process 3script

Remote Execute

CMD

New

con

nect

ion

ever

y cy

cle

Sched

ules

WebServer

Design Considerations – Starting Systems

Unix Database Server

Oracle1

Oracle2

CRON

Unix Database Server

Sybase1

Sybase2

CRON

Windows Database Server

MSS1

MSS2

Unix Application Server Windows Application Server

App 1Proc 1

App 1Proc 2

App 1Proc 3

App 2Proc 1

App 1Proc 2

CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem Logs

Design Considerations – System Capacity

Unix Database Server

Oracle1

Oracle2

CRON

Unix Database Server

Sybase1

Sybase2

CRON

Windows Database Server

MSS1

MSS2

Cap CapCap

Open SourceCapacity Tool

In HouseGUI

Unix Application Server

Cap

Windows Application Server

Cap

App 1Proc 1

App 1Proc 2

App 1Proc 3

App 2Proc 1

App 1Proc 2

CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem Logs

Design Considerations – DB Mon. Capacity

Unix Database Server

Oracle1

Oracle2

CRON DbMon

Unix Database Server

Sybase1

Sybase2

CRON DBMon

DatabaseCap Plan

Windows Database Server

MSS1

MSS2

DBMon

WebReports

Unix Application Server Windows Application Server

App 1Proc 1

App 1Proc 2

App 1Proc 3

App 2Proc 1

App 1Proc 2

CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem Logs

Database Monitoring

Design Considerations – App. Log Check

Unix Database Server

Oracle1

Oracle2

CRON

Unix Database Server

Sybase1

Sybase2

CRON

Windows Database Server

MSS1

MSS2

Agent

Unix Application Server Windows Application Server

App 1Proc 1

App 1Proc 2

App 1Proc 3

App 2Proc 1

App 1Proc 2

CPU Disk Mem Logs

AgentAgent Agent Agent

CPU Disk Mem Logs CPU Disk Mem LogsCPU Disk Mem Logs CPU Disk Mem Logs

Monitoring Serverwith thresholds &

app-specificmonitoring

configuration

ESM Arch (Generic)

Additional DepartmentalMonitoring(Application Specific)

Incident ManagementSystem

Network Monitoring

Database Monitoring

Unix Database Server

CRON DbMon

Unix Database Server

CRON DBMon

Windows Database Server

DBMon

Central EventServer

events

events

Events

Agent

Rulesevents

Live OutageReport

Unix Application Server Windows Application Server

AgentAgent Agent Agent

Monitoring Serverwith thresholds &

app-specificmonitoring

configuration

even

ts

Ticket w/eventdetails

Cap

The Two Perspectives

Operations led and Design led

– Operations led delivers the everyday working process– Operations led vision is really pre-Incident Incident

management

– Design led establishes a conduit between IT Service Management and the underlying technology

– Design led has the potential to be a very effecttive front end and interface for traditionally less visible processes:

• Performance & Management Information (dashboards) • Capacity• Availability

Strategy – What it Takes to Do EM

ServiceStrategy

ServiceDesign

ServiceTransition

ServiceOperation

CSI Start in the center ... First ask “Why?”