incident response management - metrics, data, visualize & apply

Incident Response Management

Aline TranSr. Application Support Administrator

KS Bishop Estate

Insight into IRM with metrics, data, and visualization at a $10 billion non-profit organization specializing

in commercial real estate, agriculture, land conversation, community outreach and education.

Presentation ProgressionStep 1: Metrics

Step 2: DataStep 3: Visualization

Step 4: Application!

Step 1: What Are Your Metrics?

Incident Response Volume• Monthly, yearly totals• frequency patterns: peak times

Time to Detect (TTD) • Detection resource• Origin to Detection

Time to Resolve (TTR)• Detection to Resolution

Total Response Time (TRT) • TTD + TTR = TRT• Timeline - Identify pain points

Step 2: What Does Your Data Say?

Location Incident Types Monthly Incidents

Yearly Incidents

Seattle Server Outage 15 112

Seattle Service Outage 10 33

Seattle Site Outage 9 50

India Server Outage 24 34

India Service Outage 12 133

India Site Outage 32 65

London Server Outage 10 23

London Service Outage 23 43

London Site Outage 5 88

Arizona Server Outage 12 23

Arizona Service Outage 10 55

Arizona Site Outage 27 54

WHAT DO YOU SEE?

Sandlot movie 1993

Flip the lens…

Step 3: Visualiza

tion• 3 types of incidents• India = overall most

outages• India Most

Service outages• Seattle Most

Server outages• Arizona has the

least outages• Global outages =

713

WHAT DO YOU SEE NOW?Category Incident Type Location(s) Customer

ImpactResource

Impact Detection Source Started Detected Reported Time to Resolve

Total Response

Time Reoccurring

High Service Outage Seattle 57 32 Alert 1/3/2016 1/5/2016 1/15/2016 10 12 0

Medium App Arizona 57 72 Alert 2/3/2016 2/5/2016 2/10/2016 5 7 1

Low Server India 32 62 User 3/3/2016 3/5/2016 3/8/2016 3 5 0

Medium System Arizona 47 32 User 4/3/2016 4/5/2016 4/9/2016 4 6 1

High Site Outage London 54 79 Alert 5/3/2016 5/5/2016 5/11/2016 6 8 1

Medium Server Arizona 35 25 IT Ops 6/3/2016 6/5/2016 6/8/2016 3 5 0

Medium Security Seattle 60 50 Alert 7/3/2016 7/5/2016 7/7/2016 2 4 0

High Site Outage London 50 80 Help Desk 8/3/2016 8/5/2016 8/5/2016 0 2 0

Low Security Arizona 65 85 Alert 3/3/2016 3/5/2016 3/11/2016 6 8 0

Low Service Outage India 69 54 Alert 2/3/2016 2/5/2016 2/8/2016 3 5 0

High Server Arizona 37 44 User 3/3/2016 3/5/2016 3/9/2016 4 6 1

High Service Outage Seattle 38 28 User 4/3/2016 4/5/2016 4/12/2016 7 9 1

High Service Outage London 44 34 Alert 1/13/2016 1/15/2016 1/15/2016 0 2 0

Medium App India 41 56 IT Ops 2/10/2016 2/12/2016 2/16/2016 4 6 1

Medium System London 47 67 Help Desk 3/18/2016 3/20/2016 3/25/2016 5 7 1

Low Security Arizona 49 34 Help Desk 4/13/2016 4/15/2016 4/20/2016 5 7 0

Medium Site Outage Seattle 63 38 Alert 4/13/2016 4/15/2016 4/18/2016 3 5 0

Low Server London 37 62 Alert 6/3/2016 6/5/2016 6/11/2016 6 8 0

High Service Outage India 32 22 User 7/3/2016 7/5/2016 7/11/2016 6 8 0

Might need stronger prescription…

And Now?

• Difference between TRT and TTR

• Reoccurring tickets vs total tickets

• Customer impact score vs Resource impact score (Agile story point method)

• Detection Source that catches the most incidents (Alerts)

LET’S TAKE IT FOR A SPIN! Step 4: Application

HOW DO WE USE IT? Step 4: Application Task No. Action Location Affected Description Event Resource Date Incident Hours

Communication Score Ideal Score

1Origin Seattle User 1 opens infected email attachment Event 1 User 8/26/16 1:00 PM 0.00 0 100

2Detected SeattleHelp Desk receives call from User 1 saying files are locked and ransom message displays Event 1 User 8/27/16 8:00 AM 19.00 80 100

3Reported SeattleHelp Desk notifies Desktop Team and ISO but not Ops. ISO does not notify Ops. Event 1 Help Desk 8/27/16 9:00 AM 20.00 60 100

4Contained Seattle Desktop Team reclaims infected laptop Event 1 Desktop Team 8/27/16 12:00 PM 23.00 40 100

5Analyzed Seattle

Desktop Team analyzes the laptop and begins restoration process. Desktop Team does not wait for guidance from ISO and does not notify Ops. Event 1 Desktop Team 8/27/16 1:30 PM 24.50 20 100

6Restored SeattleDesktop Team completes restore and returns laptop to user. Does not notify other teams. Event 1 Desktop Team 8/27/16 3:30 PM 26.50 10 100

7Detected SeattleUser 2 calls Help Desk and reports files are locked in a shared folder. Event 2 User 8/27/16 1:00 PM 24.00 80 100

8Reported SeattleHelp Desk notifies IT Ops to look at the shared folder. Unknown if ISO is notified. Event 2 Help Desk 8/27/16 1:30 PM 24.50 60 100

9Analyzed Seattle

IT Ops analyzes and notes the files have all been encrypted and are inaccessible. Confirms incident with Help Desk. Does not notify Infrastructure Event 2 IT Ops 8/27/16 2:00 PM 25.00 75 100

10Notified SeattleIT Ops attempts to call the ISO call tree for incidents but no one picks up. Event 2 IT Ops 8/27/16 3:00 PM 26.00 80 100

11Stalled Seattle

IT Ops does not take action in case files need to be investigated. Files on shared folder continue to be encrypted. Event 2 IT Ops 8/27/16 3:15 PM 26.15 60 100

12Replied SeattleIT Ops receives word from ISO to hold the encrypted files as evidence. Event 2 IT Ops 8/28/16 7:00 AM 44.00 70 100

13Restored Seattle

IT Ops receives word from business unit affected to restore the shared folder ASAP. IT Ops does not hold the files as evidence and fulfills the business want by restoring the shared folder from a backup. Event 2 IT Ops 8/28/16 9:00 AM 46.00 50 100

14Post-Analysis SeattleISO confirms User 1 and User 2 incident are related. Event 1 ISO 8/28/16 10:00 AM 47.00 60 100

15Post-Analysis SeattleISO creates timeline and reviews with involved teams Event 1 ISO 8/29/16 1:00 PM 72.00 80 100

16Follow Up SeattleDetailed meetings are conducted but no follow up procedures are created Event 1 ISO 8/30/16 3:00 PM 98.00 40 100

Analyze & Apply Communications MetricCommunication score per team for Incident. Hold teams accountable for communication. Score is based off Agile story point.

Communication score during the incident timeline. Ideal line at 100. Goal: Get the communication score line closer to the ideal line!

Communication takes a dive once teams are analyzing and restoring! Now we know where to focus!

How Apply?Improve communication!

Tracker role- Designate sole role to just stay

updated on progress.- Only person with direct

communication to the team. Communicator role- Eliminates communication

scramble, duplication, uncertainty and interference.

- Communicates to rest of org, external, stakeholders, etc.

• Let the Execution Team focus on work!!!!

• Scalable Pod = region

GO DANCE!

Step 1: MetricsStep 2: Data Step 3:

Visualization

Step 4: Apply

incident response management - metrics, data, visualize & apply

Technology