case study: university of chicago achieves high availability through a centralized and service...

Case Study: University of Chicago Achieves High Availability through a Centralized and Service Centric Approach to IT Monitoring

Erik Giles

DevOps: Agile Ops

The University of Chicago

Command Center Manager

DO5T17S

@ErikGiles

Abe Shaker

The University of Chicago

IT Monitoring Engineer

https://twitter.com/ErikGiles

2 © 2015 CA. ALL RIGHTS RESERVED.@CAWORLD #CAWORLD

© 2015 CA. All rights reserved. All trademarks referenced herein belong to their respective companies.

The content provided in this CA World 2015 presentation is intended for informational purposes only and does not form any type

of warranty. The information provided by a CA partner and/or CA customer has not been reviewed for accuracy by CA.

For Informational Purposes Only Terms of this Presentation


Abstract

Learn how the IT operations team at University of Chicago built a centralized and service centric approach to IT infrastructure monitoring. The University of Chicago is using CA Unified Infrastructure Management (CA UIM) to implement a central, integrated approach for monitoring IT systems, applications, networks, VoIP phones, data center infrastructure and business services. Be it PBX phones or data center water chillers, they are monitoring it all centrally. As breaking down organizational silos is critical to success in this approach, learn insights and tips on how to overcome this barrier. In additional, we will talk about the experience of moving to CA UIM from CA eHealth.

Erik Giles

Abe Shaker

University Of Chicago


Agenda

BACKGROUND

OUR APPROACH & STRATEGY

THREE PHASES OF IMPLEMENTATION

COMPARING CA UIM AND CA E-HEALTH

QUESTIONS

1

2

3

4

5

Background University Of Chicago

– Founded by Rockefeller in 1890

– ~6k Undergrads, ~10k Graduate Students, ~8K Staff and Researchers

– 89 Nobel Prize winners

– 1st Heisman Winner in 1935 (Jay Berwanger) [anyone know the original

name?]

– Campus extensions in New Deli, Paris, London, and Honk Kong

Background Erik Giles & Abe Shaker

• Erik Giles

– Univ. of Chicago Command Center Manager (SM Best Practice Consultant)

– Running Command Centers (and their tools since 2004)

– In technology since 1996 with MS in Engineering & MBA from USC

– Worked in Technical Leadership at Boeing, Orbits, Publicis, IL Tollway

• Abe Shaker

– Lead Reporting & Monitoring Engineer for Univ. of Chicago

– Working in monitoring tool management since 2010

– In technology since 1998 with degree in Electronics

– Worked at State Farm, Motorola, Dept. of Veterans Affairs, DeVry

CA UIM @ University Of Chicago

Ensures IT services availability and reliability

• Our environment at a glance

– Running CA UIM since June 2011

– Currently Monitor over 5,000 devices across the globe

(35K alarms)

– Integrate 6 “other” monitoring tools into common window

– Watched 24/7/365 by the Command Center Team

– Working with 13 Infrastructure Teams

CA UIM & CA Spectrum Architecture Diagram

Since Q1FY14, Spectrum has been upgraded twice from 8.6 to 9.4.2.1

CA UIM servers were installed on RHEL 6.6 (UIM 8.0) and were upgraded to 8.2.

Broad vs Deep

• We have been following a strategy that brings all devices

into monitoring and then slowly increases the fidelity of that

monitoring in phases.

• This is a trade off. Do you…

1. Get early wins by working with one supportive team to show how

much can be done with a strong Enterprise Monitoring tool (such as

Network or Windows).

2. Build a foundation for a complete integrated monitoring solution by

building in all devices at the same time and operationalizing it as

everyone comes on board.

We Chose #2 - Here Is Why

• We can see everything in our environment.

• We never have to rework or go back on

something to make a new technology work.

• Our culture and technology can evolve at

the same time.

• All the infrastructure teams have skin in the

game and learn from each other.

Phased Approach

• To do “broad then deep” you have to have a clear set of phases.

1. “Ping Only” to start to get everything in the system. (five months for us)

2. Hardware based SNMP Model to get proactive. (16 months for us)

3. Service based solution to align to management and customers. (just started)

• Each phase has the following

– Unique Team across leadership and technical elements

– Very Specific Outcomes and Objectives (hard metrics)

– A plan with templated deliverables and management commitments

– Standing meetings: Steering Committee (mthly), Leadership (wkly), and technical (wkly)

• Must complete each phase in order (no matter how long it takes)

• Tip: Tie project to Outage reduction and SM Metric Improvements

Seven Quarters of Outages*

0

5

10

15

20

25

Q1FY14 Q2FY14 Q3FY14 Q4FY14 Q1FY15 Q2FY15 Q3FY15

xMail Email/Calendar

Wireless Data Networking

Windows

Voicemail and Unified Messaging

UPS

UChicago Time (aka Time and Attendance)

Telephone Service

Storage

Specialty Voice Services

Service Desk

Remote Site Connectivity

Physical Network Connections

Phones & Internet Connections

Mainframe Job Scheduling

Mainframe

Gargoyle - Student Information System

Employee Self Service (ESS, Benefits Management)

DNS Management

Desktop Support

Delphi Planning

Database Management

cVPN (VPN) Virtual Private Network

ComEd Power

Cisco (VoIP)

Chalk Learning Management System

Call Center

AURA - Grants Reporting (Research/Business Objects)

Averaging an Outage a week

as we had for years up until

the monitoring effort

Phase I Monitoring catching

many more events and our

numbers go up

As we do the work for

Phase II Outages are

dropping considerably

Phase I

Production

Phase II

DevelopmentPhase II

Production

Phase I

Development

This slide is that sold leadership on the approach

3 Nines of Hardware Uptime (last two quarters)

13

99.600%

99.620%

99.640%

99.660%

99.680%

99.700%

99.720%

99.740%

99.760%

99.780%

99.800%

99.820%

99.840%

99.860%

99.880%

99.900%

99.920%

99.940%

99.960%

99.980%

100.000%

1 2 3 4 5 6 7 8 9 10 11 12 13

Q3FY15 Hardware Uptime (to date)

Average Uptime – 99.954%

99.600%

99.620%

99.640%

99.660%

99.680%

99.700%

99.720%

99.740%

99.760%

99.780%

99.800%

99.820%

99.840%

99.860%

99.880%

99.900%

99.920%

99.940%

99.960%

99.980%

100.000%

1 2 3 4 5 6 7 8 9 10 11 12 13

Q2FY15 Hardware Uptime

Average Uptime – 99.960%

This slide is that sold the technical teams

All In! (Phase 1 – Connectivity)

• Our Goal was to just get every device into spectrum via a simple Ping.

• Objectives and Outcomes

– Capture 95% of Outages in Monitoring

– Monitor all uChicago Hardware

– Operationalize Monitoring with 24/7 Staff

– Report on Hardware uptime and alarm counts

• Business Notes

– We exceeded our outage capture targets (98% outages captured on ping only)

– Getting all the hardware integrated was much harder than you’d think

– Letting an operational team monitor other team’s hardware was a BIG cultural shift

– Spent a lot of time getting the operational side right (and consistent)

– Avoid politics by focusing on the data (always have data!)

Connectivity Technical Strategy

• Top Five Technical Notes

– Hardware Firewall Issues

• ICMP blocked in most locations across campus

– Software Firewall Issues

• IPTABLE rules needed to be put in place to allow communication from our SpectroSever

– Network ACLs

• ACLs had to be updated on our entire distribution layer to account for the source traffic from

Spectrum

– Non-routable IP space

• Lots of private IP space required the setup of VPNs to allow communication

– Used IPs and old DNS entries

• We noticed a lot of discrepancies between the list of systems to be monitored and how

Spectrum was discovering them. Large blocks of re-used IPs w/ out the proper update to DNS

Getting Proactive(Phase II – SNMP)

• The goal was to get as many devices on SNMP as we had licenses and integrate the rest so

that we could be proactive at a hardware level.


– Reduce Outages by 50%

– Reduce Severity 1-3 Incidents by 50% overall

– Operationalize responses to non-Outage events (adding Major/Minor Alarms)

– Real-time dashboards and reports for all hardware teams

• Business Notes

– We far exceeded our outage goals with a 300% reduction in Outages (before we finished)

– Zero teams wanted to commit to this goal (management intervention required)

– Huge emphasis on cleaning up technical debt and following processes

– Even technical folks didn’t understand the difference between “events” and “utilization”

– Best work was done with technical folks worked with each other (without management)

SNMP Technical Strategy

• Top Eleven Technical Notes

1. The same challenges experienced w/ ICMP across all of Phase I had to be re-visited w/ SNMP

2. SNMP Device Certification of custom models

3. Lack of SNMP support

• Encountered some critical devices that either didn’t support SNMP or had it disabled

4. NATd IPs

• Had issues getting NATd models monitored

5. No use of agents

• Agents were not allowed on certain servers which limited our monitoring options

6. Lots and lots of 3rd party monitoring system integrations

• Instead of natively monitoring models we have a large number of third party integrations. This presented us

w/ a new set of challenges– Mapping alarm and event fields between applications

– Auto-clearing

– Dynamic Alarm fields

– Mapping of severities

– Two-way communication for note entries and the ACK check mark

SNMP Technical Strategy (continued)

7. Technical challenges in dealing w/ the increase of monitoring from just ICMP to SNMP

• Server and Network load– Massive Increase in events

8. Architectural Diagram

• Ran into issues during upgrades as OneClick was installed on our SRM server (w/ out knowing)

9. LDAP Issues

• Had to abandon direct Spectrum to LDAP solution as we were having that hung-session/time-out problem.

Instead EEM was upgraded and re-integrated

10.Uniformity Across User Accounts

• We initially had different views for different techs in the command center which led to some alarms getting

missed. Locked the initial view down on their user preferences to resolve.

• Noticed there were far too many user groups w/ each having their own set of permissions. Simplified

permissions to better match the logical grouping of the University

11.The local_policy.jar file

• We have a number of non-technical users who do not have admin rights to their computers. The Spectrum

upgrade requiring the local_policy.jar file to be placed into the Java security folder required elevated

permissions.

18

Service Centric (Phase III – Service View)

• Our goal here is to go from a hardware centric view to a services centric view by regrouping all the

hardware by service and then monitor and measure by service.


– Three 9’s (99.9%) of uptime for all uChicago IT Services (44 minutes of downtime a month)

– Command Center actually monitoring “services” not just the hardware

– Vanity Dashboards by Service for Customers and Management

• Business Notes

– (can’t give you outcomes yet because we are just starting this phase)

• Our hardware is already at Four 9’s

– Application Teams are very supportive of doing this work and active participants

• BUT… any shortcuts you took with Hardware at SNMP will now bite you in the a__!

– Having a CMDB makes this easier but if not you’ll end up building a lot of the basics of it

– People are obsessed with Synthetic Transaction Monitoring (without understanding it)

– With a totally new team it feels a lot like starting over from a “buy-in” perspective

CA eHealth to CA UIM

• Business Advantages

– Provides far greater monitoring capability with improved scalability

– Significantly improved look and feel with lots of dashboard customization options

– Better quality and “current-ness” of technical support

– Clearly the futureproof option with the CA product

• Technical Challenges

– There was a lot of effort into getting FW rules and ACLs updated to allow SNMP polling from the new UIM servers

– Currently the EEM (single sign on) solution w/ eHealth allowed Spectrum users to view eHealth content from OneClick w/ out

having to re-authenticate. No single sign on solution exists yet.

• Note, CA UIM doesn’t support Unix/Linux based LDAP authentication. Windows AD or eDirectory only

– Migration of any monitoring setup in eHealth (i.e. network or disk utilization) to CA UIM

– CA UIM’s architecture is different than eHealth which could lead to a larger application footprint

• Suggested setup could have 4 different servers at the minimum

– Two UIM servers (primary/standby)

– External DB server (SQL or Oracle)

– UMP Server (Webserver)

– We have our SNMP collector housed on a separate server as well

– Work will have to be put in to duplicate any reports generated in eHealth as the default/out-of-the-box reports in CA UIM differ

– A learning curve will be present as you’ll have to get use to a new interface, setup procedure, application maintenance, etc…

Synthetic Transaction Monitoring

• Synthetic Transaction Monitoring is the “cool thing” in monitoring and you cannot implement an

Integrated Central solution on local enterprise systems without getting feedback that if we just did STM

then none of this would mater and it’d all be easier. (usually from your cloud loving app folks)

• STM does have its place and can be a key element of your strategy– It WILL find more things than traditional monitoring will do.

– It WILL give you a better understanding of the customer experience.

– It IS easier to setup and get started right away.

• But STM is not really a complete solution and here is why– It WON’T tell you what is actually wrong with your systems, including where the problem is.

– It WON’T actually end up being cheaper because to come even close to the amount of data you can get from a traditional

monitoring solution would be 5X as expensive.

– It ISN’T a complete solution unless all you are looking for is connectivity and latency.

• We DO plan to use STM down the line (Phase IV)– It will improve our enterprise solution by identifying issues for which we didn’t configure (yet)

– It will give us a solution for Cloud-Based solution that we cannot monitor traditionally

– It will show us the customer experience (connectivity and latency)

Achievements by uChicago

Three 9’sAvailability

Reduced Outages by

300%

ENHANCED VISIBILTY

ACROSS THE ORGANIZATION

BETTER SERVICE QUALITY

IMPROVED SCALABILTIY

FUTURE PROOF


Q & A


For More Information

To learn more, please visit:

http://cainc.to/Nv2VOe

CA World ’15

http://cainc.to/Nv2VOe

http://bit.ly/1wbjjqX

case study: university of chicago achieves high availability through a centralized and service...

Technology