what to do when it all goes so wrong

37
David Levy AdventuresInSql.com SQL Saturday #67 Chicago

Upload: david-levy

Post on 08-Jul-2015

823 views

Category:

Technology


5 download

DESCRIPTION

As IT Professionals we inevitably will see situations where everything goes wrong. At times we are somewhat lucky and this just means diminished functionality or a slow system. Other times our organization is temporarily out of business. Regardless of the scope of the issue, how we react can have a direct impact on how quickly things are returned to normal. This session will cover how to communicate issues, including what to say, who to say it to and when to say it. Part of managing communication is to get everyone into a room, forcing them to talk, so time will be spent on designing an effective war room. The session will also cover how by setting out to prove that an issue is ours we are able to more quickly get at a root cause.

TRANSCRIPT

David Levy

AdventuresInSql.com

SQL Saturday #67 Chicago

More than 11 years in IT

SQL Server DBA for over 3 years

Previous Life as Developer

Blogger◦ http://adventuresinsql.com

◦ Syndicated on SQLServerCentral.com

◦ Syndicated on SQLServerPedia.com

@dave_levy on Twitter

Peak Time of Peak Sales Day

Typical Hourly Sales $100K/HR

Order Entry Screen is Locked Up

Users report Slowness Initially

Now the “Sales Center” Application is Just “Clocking”

Let Everyone Know There is a Problem◦ Prevent Duplicated Efforts

◦ Allows Others to Speak Up

Recent Changes

Related Issues

http://www.freedigitalphotos.net/images/view_photog.php?photogid=1983

Send Up a Flare◦ Send to an IT Only Distribution Group

◦ Keep the Subject Line General

◦ Provide Broad Overview Including:

Systems Impacted

Major Symptoms Including Error Messages

Number of People Impacted

Any Location Specific Information

What Resources Do You need?◦ Subject Matter Experts

◦ Specialized Equipment

Never Assign Blame

Only State Facts

To: IT Emergencies

Subject: Sales Center Issues

Sales Center Users are reporting that the Order Entry screen has quit responding. We are currently investigating the issue with the Sales Center Development Team. We will provide updates as we know more.

Collect

Process

Respond

What Are the Symptoms?

What Locations are Involved?

What Systems are Involved?◦ SQL Server

◦ AS400

◦ Mainframe

◦ Web Farm

◦ Major Network Components like Load Balancers

What Has Changed?◦ Look at Change Control Calendar

◦ Talk to Primary On-Calls for Related Systems

Anything in the Logs?◦ Windows Logs

◦ Application Specific Logs

◦ Custom Exception Handling Systems

What are Performance Indicators Showing?◦ Perfmon

◦ SQL Wait Stats

◦ Third-party tools

Analyze Collected Information◦ Are There Any Obvious Signs of Trouble?

◦ Can the Problem be Linked to a Change?

◦ Can Any Patterns be Identified?

Prove It Is Your Issue◦ Shows Humility

◦ Shows Respect for Everyone Else’s Time

◦ Avoid Appearing Arrogant

Prove It Is Your Issue◦ Construct Tests to Prove Theories in Order of

Likelihood Until Problem Proven or Theories Exhausted

Faster than arguing about what it is not

How can you know it is not your issue?

List Potential Actions◦ Rank by effort, confidence, level of risk

◦ Develop action plans for best options and re-rank

◦ Each potential action should have a rollback plan

Define Measures◦ What will indicate things have gotten better?

Adding this index will reduce Disk IO by 10 million reads per second

The execution time of query x will drop from 6 minutes to 50 milliseconds

Define Measures◦ What will indicate things have gotten worse?

Disk IO may go up

The execution time of query x may go up

Adding this index may slow inserts from the order upload process

Communicate Your Intentions

Make the Change◦ Follow a written plan

◦ Make a single change

◦ A single person should make the change

◦ Document any additional steps taken

Start Over by Collecting More Data

Signs You Need to Convene A War Room◦ Having Trouble Finding Anything Wrong

◦ 30 Minutes Without Progress

◦ An Issue Appears to Span Multiple Systems

◦ Having Difficulty Getting People Engaged

Get Everyone in a Room

No Changes Made Outside the Room

No Heroes◦ Watch out for people doing a lot of typing

◦ Avoid changes that take more than a few minutes

Have a Call in Number for Remote Coworkers

Have a Technology Kit◦ Old Switch

◦ Patch Cords

◦ Mice + Mouse Pads

◦ Power Strips

Monitor Your Guest List◦ 1-2 Representatives From Each Team

◦ Try to Keep Management Out

◦ Watch for Disruptive People

To: IT Emergencies

Subject: Sales Center Issues

We are convening a war room for the Sales Center issue. Everyone working on the issue please meet in the North Conference Room. Remote/WFH coworkers should dial into the conference bridge 888-888-1234, participant code:1234.

Collect

Process

Respond

White Board the Issue◦ Every System Gets Own Column

◦ Write All Facts on White Board

◦ Closed Items Get Crossed Out Not Erased

◦ Include a Resolution for Each Closed Item

Share the Floor◦ Likely Issue Owner Has the Lead

◦ Make Sure Everyone is Heard

◦ Contributing Often Involves Staying Out of the Way

◦ Don’t Be Afraid to Fade Back and Run The Whiteboard

Never Call “Not-It” and Leave◦ Not Helpful

◦ You May be Wrong

◦ Appears Arrogant

Keep an Eye On Time◦ Provide Regular Updates to Management

◦ Bring in Food Around Meal Times

Raises Spirits

Brings in More People to Help

To: IT Emergencies

Subject: Sales Center Issues Update

The Sales Center war room is still going. We are currently looking into a driver issue with IBM. All necessary resources have been engaged.

Keep People in Reserve◦ Each Team Should Divide up the Day

◦ Rotate People In and Out

◦ Send Someone Home Early to Come in Early

Closing Out◦ Communicate Resolution

◦ Capture Contents of Whiteboard

◦ Clean Up Room

To: IT Emergencies

Subject: Sales Center Issues Resolved

The Sales Center issue has been resolved. The issue was caused by a patch that was applied over the weekend. Now that it has been backed out everything has returned to normal.

?