visible ops: building effective & auditable itil change management processes in 4 steps: phase...
Post on 21-Dec-2015
219 views
TRANSCRIPT
Visible Ops:Building Effective & Auditable ITIL Change
Management Processes in 4 Steps:Phase One
Gene Kim, CTO, Tripwire, Inc.October 27, 2004
The Challenges The Challenges
How do I simultaneously contain costs, improve security and service levels, and address regulatory compliance?
What is my first step in building an ITIL change management process? How will I know that it’s working?
– What order should I tackle the ITIL process areas?
How do I attest to auditors that I have effective change management processes?
– Sarbanes-Oxley Section 404
– HIPAA, GLBA, CFR11a, etc.
– How do COBIT and ITIL fit together?
How do I create a good working relationship with my auditors?
– What do auditors doing controls-based auditors look for?
– What happens if they cannot find effective controls?
AgendaAgenda
Examine the high-performing IT operations and security organizations
– What they all have in common
– What we can learn from them
Define the ideal working relationship between IT and audit
– Why auditors talk the way they do
– What auditors need to see
Building auditable and effective change management processes in four steps:
– Stabilize Patient
– Catch & Release and Find Fragile Artifacts
– Establish Repeatable Build Library
– Enable Continuous Improvement
The Highest Performing IT Organizations…The Highest Performing IT Organizations…
High performance Ops and Security organizations have:
•Highest ratio of staff deployed on pre-production processes
•Lowest amount of unplanned work
•Highest change success rate
•Best posture of compliance and security
Security Management
Availability & Contingency Management
Service Level Management
Service Reporting
Capacity Management
Financial Management
Control ProcessesAsset & Configuration Management
Change Management
Release Processes
Release Management Resolution Processes
Incident Management
Problem Management
Supplier Processes
Customer Relationship Management
Supplier Management
Automation
Service Design & Management
Common Process Areas Of High PerformersCommon Process Areas Of High Performers
All the high-performers had self-derived the same way of working
– Culture of change management
– Culture of causality
– Culture of compliance and desire to continually reduce variance
Common Traits Of The Highest PerformersCommon Traits Of The Highest Performers
Culture of change management– Integration of IT operations and security processes via problem
management and change management processes– Processes that serve both organizational needs, as well as
business objectives– Highest rate of effective change (approved changes, change
success rate) Culture of causality
– Highest service levels (MTTR, MTBF)– Highest first fix rate (unneeded rework)
Culture of compliance and continual reduction of operational variance– Production configurations– Highest level of pre-production staffing– Effective pre-production controls– Effective pairing of preventive and detective controls
Causal Factors of IT DowntimeCausal Factors of IT Downtime
Operator Error
60%System Outages
20%
Application Failure
20%
SecurityRelated
Non-SecurityRelated
5%5%
15%15%
Source: IDC, 2004
4 - Continuously Improving
• <5% of time spent on unplanned work
• Change success rate is very high
• Service levels are world class
• IT operating costs are under control
• Can scale IT capacity rapidly with marginal increases in IT costs
• Change review and learning processes are in place
• Able to increase capacity in a cost-effective way
3 - Closed-Loop Process
• 15-35% of time spent on unplanned work
• Some ticketing / workflow system in place
• Changes documented and approved
• Change success rate is high
• Service levels are pretty good
• Server-to-admin ratio is good, but not BoB
• IT costs are improving but still too high
• Security incidents down
2 - Using Honor System
• 35-50% of time spent on unplanned work
• Some technology deployed
• You have the right vision but no accountability
• Server-to-admin ratio is way too low
• IT costs are too high
• Process subverted by talking to the “right” people
1 - Reactive
• Over 50% of time spent on unplanned work
• Chaotic environment; lots of fire fighting
• MTTR is very long; poor service levels
• Can only scale by throwing people at the problem
Capability LevelsCapability Levels
Reactive Using The Honor System Closed-Loop Change Mgt
Eff
ect
iven
ess
ContinuouslyImproving
Based on the IT Process Institute’s “Visible Ops” Framework
Changes control the organization:
Organization controls the changes:
Why Auditors Do The Things They DoWhy Auditors Do The Things They Do
Given enough time and resources, auditors would love to count all the beans
– Go into the warehouses, open up all the containers, and inspect all the contents
– Rarely does this actually happen, for obvious reasons
Instead, auditors go to the “bean counting machine” to see whether the results are trustworthy
– What controls ensure that it hasn’t been subverted?
– What controls ensure that the results are correct?
For a variety or reasons, auditors are shifting from substantive audits to control audits
IT Controls 101IT Controls 101
Preventative Controls
– Separation of duties
– Change management and authorization processes
Detective Controls
– Production controls around change management and configuration management
Corrective Controls
– Restoration and backup systems
Ideal Attestation of ControlsIdeal Attestation of Controls
High performing shops typically have the highest service levels and the lowest cost of controls– Best service levels (MTTR, MTBF), lowest amount of
unplanned, unscheduled work, highest server/sysadmin
– Best working relationship with audit.
– Least amount of time dedicated to compliance activities Why?
– They can point to their change management and governance process (preventative controls)
– They can show that the processes are working (detective controls)
How?– Change management meeting minutes
– Three-ring binder of change orders and verified changes
COBIT and Change ManagementCOBIT and Change Management
COBIT AI6: Managing ChangesCOBIT AI6: Managing Changes
Control Objective Tripwire’s Role6.5 Documentation and Procedures
The change process should ensure that whenever system changes are implemented, the associated documentation and procedures are updated accordingly.
Tripwire is used to validate that all changes are tracked, synchronized with documentation (run books, etc.), and applied consistently across the appropriate systems.
6.6 Authorised Maintenance
IT management should ensure maintenance personnel have specific assignments and that their work is properly monitored. In addition, their system access rights should be controlled to avoid risks of unauthorised access to automated systems.
By reporting what changed on each system, when it occurred, and who made the change, Tripwire is used to ensure that all changes made are authorized, and made by authorized personnel . “Out of scope” changes, inconsistently applied changes, changes that occur outside the maintenance window, and other inappropriate changes are therefore discovered before they impact system availability.
6.7 Software Release Policy
IT management should ensure that the release of software is governed by formal procedures ensuring sign-off, packaging, regression testing, handover, etc.
Tripwire enables IT management to validate that formal sign-off processes are adhered to. Tripwire is also commonly used to ensure that packages are not altered during handoffs (through pre- and post- handoff comparison of released packages).
6.8 Distribution of Software
Specific internal control measures should be established to ensure distribution of the correct software element to the right place, with integrity, and in a timely manner with adequate audit trails.
Tripwire can compare changes that occur on production systems back to a reference baseline to ensure that software distribution happens consistently across target systems, within the prescribed time. All changes are recorded for historical and audit-related reporting and analysis.
COBIT DS9: Managing The ConfigurationCOBIT DS9: Managing The Configuration
Control Objective Tripwire’s Role
9.2 Configuration Baseline
IT management should be ensured that a baseline of configuration items is kept as a checkpoint to return to after changes.
Configuration baselines are a core competency of Tripwire software. Tripwire maintains a history of current and previously authorized baselines to determine whether the current (as-is) device and system configuration matches the authorized (as-specified) state, according to the configurations you’ve authorized for use in your environment. Tripwire can also enable rollback to an authorized state either by performing the rollback directly or providing a manifest to drive third-party restoration / provisioning systems.
9.4 Configuration Control
Procedures should ensure that the existence and consistency of recording of the IT configuration is periodically checked.
Tripwire integrity checks provide the means to assess the existence, consistency, and conformance of device and system configurations. These checks can be performed on an automatic, ongoing basis, as well as initiated on-demand by administrators.
9.5 Unauthorised Software
Clear policies restricting the use of personal and unlicensed software should be developed and enforced. The organisation should use virus detection and remedy software. Business and IT management should periodically check the organisation’s personal computers for unauthorized software. Compliance with the requirements of software and hardware license agreements should be reviewed on a periodic basis.
Tripwire is frequently used by customers to identify unauthorized or “rogue” applications within the production environment. This aids in enforcing configuration standards, as well as assisting in identification. Isolation, and recovery from “day zero” attacks from viruses or worms
The Tragic Truth About AuditorsThe Tragic Truth About Auditors
Auditors gravitate to where controls appear weakest To attract the attention of auditors, have
unexplained outages and lots of unexplained changes
– “The top leading indicators of risk when we look at an IT operation are: poor service levels and unusual velocity of changes.” Bill Philhower
Visible Ops: Four Steps To Build An Effective Visible Ops: Four Steps To Build An Effective Change Management ProcessChange Management Process Each of the four Visible Ops steps is:
– A finite project: not a ISO 9001 initiative or a vague 5-year vision
– Catalytic: returns more resources to the organization than it consumes, fueling the next steps
– Sustaining: process stays in place, even when the initial force behind it disappears
– Auditable: supports factual reporting and attestation to process adherence and consistency
– Ordered: must be done in the specified order to achieve the above
Model based on five years studying high-performing IT Ops and Security organizations
Visible Ops has been donated to the ITPI
Security Management
Availability & Contingency Management
Service Level Management
Service Reporting
Capacity Management
Financial Management
Control ProcessesAsset & Configuration Management
Change Management
Release Processes
Release Management Resolution Processes
Incident Management
Problem Management
Supplier Processes
Customer Relationship Management
Supplier Management
Automation
Service Design & Management
Visible Ops: Four Steps To Build An Effective Change Visible Ops: Four Steps To Build An Effective Change Management ProcessManagement Process
Phase 1: Electrify Fence, Modify First Response
Phase 2: Catch and Release, Find Fragile Artifacts
Phase 3: Establish Repeatable Build Library
Phase 4: Continually improve
Tripwire enforces the change process.
Tripwire rules out change as early as possible in the repair cycle.
Tripwire protects fragile artifacts.
Tripwire enforces change freeze and prevents configuration drift.
Tripwire captures known good state in preproduction.
Tripwire captures production changes that need to be baked into the build.
Tripwire detects change, which all process areas hinge upon.
Phase 1: Stabilize Patient, Modify First ResponsePhase 1: Stabilize Patient, Modify First Response
Tripwire and IP Services
Phase 1: Stabilize Patient, Modify First Response
IssuesIssues
We have a tendency to “light and fight” our own fires
• 80% of outages are self-inflicted
• 80% of MTTR is dominated by asking “what changed?”
With sufficiently low change success rate, high rate of change, and high MTTR, we are spending all our time doing unplanned, unscheduled work
• Best in class: 5% of OpEx is spent on unplanned work
• Average: estimated around 25-45%
Changes are made without authorization, proactive scheduling, or full documentation
"The most likely way the world will be destroyed, most experts agree, is by accident. That's where we come in; we're computer professionals. We cause accidents."
Nathaniel Borenstein
Stabilize PatientStabilize Patient
Curb the major cause of outages: 80% of outages are self-inflicted
Identify critical patients, clear everyone away from them unless they are authorized to operate
Document this new change policy: no changes unless authorized (preventative)
At this point, anyone even holding a scalpel should be viewed with suspicion
Electrify The FenceElectrify The Fence
We have now prescribed our first preventative change process and policy– Why do most change management initiatives fail?
– What is the top audit finding around change controls? Now we must “manage by fact” instead of “manage by belief”
by electrifying the fences– No one is allowed to be inside the change fence except on the
weekends
– Why did Joe Bob touch the fence on Monday at 2:11am? Document what should happen to Joe Bob:
– Public shaming, take a day off, or more…
“What is often overlooked is that if one person can single-handedly save the ship, that one person can probably single-handedly sink the ship, too.”
-- Unknown
Create Change TeamCreate Change Team
Get all necessary stakeholders who can best make decisions about changes, encompassing business goals, operational risks, technical risks, etc.
Key stakeholders for us Security Lead, Ops Systems Engineering Lead, VP of Operations, Service Desk Manager, Director of Network Operations, and Internal Audit
Create weekly change management meetings mandatory for all CAB members.
Hold Weekly Change Management MeetingsHold Weekly Change Management Meetings
Create a path from desired change, to requested change, authorized change, scheduled change, implemented change, verified change.
Review implemented changes and ensure that all actual changes mapped to authorized work
Enable highest change throughput for the organization, best serve business needs, with the least amount of bureaucracy possible– Weekly 15 min change management meetings are possible,
with practice
– Keep good records of requested changes, authorized changes, and scheduled changes
Change Management GuidelinesChange Management Guidelines
Don’t:– Don’t authorize changes that do not have rollback plans that
everybody reviews
– Don’t allow “rubber stamping” approval of changes
– Don’t let any system changes off the hook – someone made it, so understand what caused it
Do:– Do post-implementation reviews to determine whether the
change succeeded or not
– Do track the change success rate
– Do use the change success rate to avoid making historically risky changes
“It’s not the strongest species that survive, nor the most intelligent… but the one most responsive to change.”
– Charles Darwin
Spectrum: Managing ChangeSpectrum: Managing Change
Don’t expect to be doing “closed loop” change management right out of the chute – awareness is better than being oblivious, managed is better than unmanaged!
Spectrum
– Oblivious to change: "Hey, did the switch just reboot?"
– Aware of change: "Hey, who just rebooted the switch?"
– Announcing change: "Hey, I'm rebooting the switch. Let me know if that will cause a problem."
– Authorizing change: "Hey, I need to reboot the switch. Who needs to authorize this?"
– Scheduling change: "When is the next maintenance window - I'd like to reboot the switch then?"
– Verifying change: "Looking at the fault manager logs, I can see that the switch rebooted as scheduled."
This is what SO-404 requires! (Preventative and detective controls)
Create Trusted Authorized Work Queue and Change Create Trusted Authorized Work Queue and Change CalendarCalendar
Create a work ticketing system that contains all the authorized work that went through the change management process
Create a change calendar (Forward Schedule Of Change) that the change manager uses to coordinate resources, manage risks, etc.
Modify First Response (1/2)Modify First Response (1/2)
The key to a catalytic change management process is that it must return value back to the organization
Decrease MTTR, dominated by 80% where people ask “what changed?” by integrating change management process into problem management
Whenever problem managers are mobilized, have all authorized changes and actual changes in the work ticket
The Microsoft MOF study showed that their best in class customers rebooted their servers 20x less often, and also had 5x fewer “blue screens of death.”
Modify First Response (2/2)Modify First Response (2/2)
Eliminate change as early as possible by identifying the assets directly involved in the ticket and auditing them against their configuration baseline for the last 72 hours. All changes found are attached to the ticket.
If no changes are found the circle is widened to include changes made to infrastructure supporting the target systems.
“Grant me the Serenity to accept the things I can not change, Courage to change the things I can, and Wisdom to know the difference.”
– Dr. Reinhold Niebuhr (excerpt from the Serenity Prayer)
Phase 1: What You Have BuiltPhase 1: What You Have Built
Documented correct path from desired change to authorized change, scheduled change, implemented change, and verified change
Created documentation that the process is working Returning value back to IT Ops by reducing MTTR,
increasing change success rate and effective change throughput
What To Show The SO-404 TeamsWhat To Show The SO-404 Teams
Change governance and management processes Meeting minutes of the change management meetings Authorization processes “Three ring binder” of stapled items:
– Authorized work order
– Change report on infrastructure showing correct changes made
– Signature of change manager verifying correct implementation of change
What To Show The AuditorsWhat To Show The Auditors
List of all outages and unscheduled downtime Change management metrics
– Change rate (per week)
– Change success rate
– MTTR, MTBF
This would make most auditors breathe a sign of relief
Security Management
Availability & Contingency Management
Service Level Management
Service Reporting
Capacity Management
Financial Management
Control ProcessesAsset & Configuration Management
Change Management
Release Processes
Release Management Resolution Processes
Incident Management
Problem Management
Supplier Processes
Customer Relationship Management
Supplier Management
Automation
Service Design & Management
Visible Ops: Four Steps To Build An Effective Change Visible Ops: Four Steps To Build An Effective Change Management ProcessManagement Process
Phase 1: Electrify Fence, Modify First Response
Phase 2: Catch and Release, Find Fragile Artifacts
Phase 3: Establish Repeatable Build Library
Phase 4: Continually improve
Tripwire enforces the change process.
Tripwire rules out change as early as possible in the repair cycle.
Tripwire protects fragile artifacts.
Tripwire enforces change freeze and prevents configuration drift.
Tripwire captures known good state in preproduction.
Tripwire captures production changes that need to baked into the build.
Tripwire detects change, which all process areas hinge upon.
Which Metric Do You Want To Improve?Which Metric Do You Want To Improve?
Release– Time to provision known good
build
– # turns to a known good build
– Shelf life of build
– % of systems that match known good build
– % of builds that have security sign-off
– # of fast-tracked builds
– Ratio of release engineers to sysadmins
Controls– # of changes authorized per
week– # of actual changes made per
week– Change success rate– # of emergency changes– # of service-affecting outages– # of “special” changes– # of “business as usual” changes– Change management overhead– Configuration variance
Resolution– MTTR, MTBF– % of time spent on unplanned
work
Phase 4
# of productionchanges
failed change %or
unauth changes
mean timeto repair
% of time spenton unplanned
workX X =
Highperformer
> 1000 chg/wk < 1% minutes < 5% of OpEx
Average unknown,hundreds
~30-50% (avg) hours,days
35-45% of OpEx
Average: 35-45% of OpEx spent on unplanned work!
Impact: late projects, rework, compliance issues, uncontrolled variance, etc…
Why Is Unplanned Work Such A Good Indicator?Why Is Unplanned Work Such A Good Indicator?
# of productionchanges
failed change %or
unauth changes
mean timeto repair
% of time spenton unplanned
workX X =
Behaviors that increase change success rate:
• Effective change testing• Effective risk review when approving changes• Effective identification of change stakeholders• Effective change scheduling
Behaviors that reduce unauthorized changes:
• Culture of change management• Management ownership of change process• Effective monitoring of infrastructure with detective controls to enforce change process• Management use of corrective action when change processes are not followed
Behaviors that decrease MTTR:
• Culture of causality: desire to rule out change first in problem repair cycle • Effective change management process that can report on authorized and scheduled changes• Ability to distinguish planned and unplanned outage events• Effective communications around scheduled changes• Effective monitoring of infrastructure for production changes
What Affects These Variables?What Affects These Variables?
What Do These Transformations Look Like?What Do These Transformations Look Like?
Examples
– Joe Judge at Adero
– Ken Larson at Schlumberger-SEMA
– Kevin Behr at IP Services
Financial returns of process transformations
– Increased availability and decreased MTTR
– Reduction of unplanned work from 50% to 5% of OpEx
– Increased delivered capacity by 2x with 10% increase in OpEx
– Increased delivery of planned projects that deliver higher value to the business
– Fulfilled compliance and reduced cost of compliance
Why Do Auditors Love Continuous Improvement?Why Do Auditors Love Continuous Improvement?
Controls are owned by the business to meet business objectives! Instead of there only to make auditors happy!
Auditors hate dragging organizations to implement controls, especially if creates grudging and literal interpretations of findings
Continuous improvement requires process and controls, to detect and reduce variance
ITIL and COBITITIL and COBIT
ITIL defines the set of all IT operational processes COBIT defines all the controls that can be wrapped
around them ITIL and COBIT are complementary and
orthogonal:
– Six Sigma defines how to build processes and their corresponding controls to continually monitor and reduce variance
– ITIL defines the change management processes
– COBIT defines the controls to ensure that the ITIL processes are auditable and effective
Caught in the Crossfire of ChangeCaught in the Crossfire of Change
Rate of change is increasing with no signs of slowing
SarbOx, GLBA, CISP, etc. Distributed systems
Heterogeneous environments
Service levelsRisk mitigation
Business objectives
Quality improvement
Staffing & Budgets
ProcessControls
Preventive
Detective
SecuritySecurity
Corrective
Getting Control of ChangeGetting Control of Change
Control frameworks prescribe internal controls to enhance operational performance, security, and regulatory compliance
– COBIT, ITIL, ISO17799, SAS70
ProcessControls
Preventive
Corrective
Change ManagementChange Management
Detective
• Change ticketing• Help desk ticketing
• Backup/Restore• Provisioning
• Manual inspection• Scripts
• Firewalls, AV, IDS• Vulnerability analysis• Identity management
• Manual inspection• Scripts
• Patch management• Provisioning
Detective• Automated change monitoring• Integration with other tools & controls• Change documentation & reporting
Infrastructure Systems(Servers, Network Devices, etc.)
Tripwire Change Auditing SolutionsTripwire Change Auditing Solutions
1. Actual changes are detected on production systems and reconciled with approved and intended changes
ChangeManagement
IncidentManagement
ReleaseManagement
Enterprise Management Systems
Actual
Approved Intended
Unexpected
2. Change auditing results then flow back to change tickets, trouble tickets, audit and mgmt reports, plus configuration mgmt databases (CMDB)
CMDB
VerifiedReconciled
AuditReports
MgmtReports
Can You Answer These Questions?Can You Answer These Questions?
Pick any piece of your infrastructure (router, server, firewall, etc.)– If a change is made to this device, how will you know?
– How soon will you know?
– How will you know if the change is good or bad?
• How long will that process take?
• What happens when the change is good?
• What happens when the change is bad?
– How do you verify that each change has been reconciled?
– How do you report on all of the above?
– Can you provide a historical report accounting for all changes in your environment?
This is what auditors want to know about how changes are managed in your IT infrastructure
With Tripwire, you can answer all of these questions
Improving Service Quality And AvailabilityImproving Service Quality And Availability
Problem: Change management in place, but lacked enforcement
Saw changes occurring, but didn’t have the means to validate
Customer: IT Services operations of a Major Energy Services company
Tripwire solution: Tripwire detects change and puts “teeth” in the process
Tracking What, When, Who, How and Why a change was made Tripwire provides “black and white documentation” to enforce
process
Increased staff efficiency, uptime, and service quality “We used to spend 45% to 50% of our time on unplanned work.
Now it’s around 5%.” “In spite of force reductions, customers describe our services as
‘phenomenally better’ now.”
Get Involved!Get Involved!
Join ICOPL (ITPI Community Of Practice List-Serv)
– http://www.itpi.org/home/icopl.php
There is now a Visible Ops Pocket Guide!
– http://www.itpi.org/home/visibleops.php
We are looking for volunteers to help with our research projects.
IMCA is now online at the ITPI
– http://www.itpi.org/home/imca.php
If you have a high performing organization, we want to study you!
SummarySummary
Control is possible. We merely need to look at the high-performing IT organizations to confirm this.
Transformation is possible. Visible Ops is the result of years of studying high-performing IT operations and security organizations in conjunction with the ITPI
Visible Ops illustrates how interested organizations might replicate the processes of these high-performing organizations in just four, achievable steps
Gene Kim: [email protected]