five 9s for sans w/o breaking the bank presented by marc staimer president & cds (chief dragon...

35
Five 9s for SANs w/o Breaking the Bank Presented by Marc Staimer President & CDS (Chief Dragon Slayer) Dragon Slayer Consulting

Upload: mateo-headen

Post on 14-Dec-2015

225 views

Category:

Documents


0 download

TRANSCRIPT

Five 9s for SANsw/o Breaking the Bank

Presented by Marc StaimerPresident & CDS (Chief Dragon Slayer)Dragon Slayer Consulting

Agenda

What is Five 9s?

How this relates to SANs

Reality Check

What you should do

What is Five 9s & What does it Really Mean?

0

2000

4000

6000

8000

10000

12000

98.000 99.000 99.900 99.990 99.999

Scheduled or Unscheduled Downtime Minutes

Five 9s Generally Defined

99.999% is another term for “High

Availability”

What does “Availability” mean?

Availability is the proportion of time that

a system can be used for productive work

Then what does “five 9s” mean?

Scheduled & Unscheduled downtime does

not exceed ~ 5 minutes per year

Perspective: Annual downtime =• Less time than it takes to drink a cup of coffee

• 1/6th the time of the average daily commute

What about Four 9s or less?

Four 9s = ~ an hour of downtime/yr

Three 9s = ~ 9 hours of downtime/yr

Two 9s = ~ 4 days (88 Hours) of

downtime/yr

Can you live two, three, or four 9s? …it Depends

On the Application

The types of outages you can live with

The cost of downtime for those

applications

The cost of high availability such as five

9s.

Application Availability Dependencies

Mission criticalness

Productivity loss from downtime

Alternatives

Outage dependencies

You may be able to live w/two 9s if:• There are 88 separate outages of 1 hour each

through the year

It is a different story if it is 1 outage

nearly 4 days• This could put a business out of business

Cost of downtime

The cost of app downtime

can be prohibitive

Direct costs of downtimeper Gartner Group

Industry Average Loss/Hr.

Brokerage Operations $6,450,000

Credit Card Authorizations $2,600,000

E-commerce $240,000

Package Shipping Services $150,250

Home Shopping Channels $113,750

Catalog Sales Center $90,000

Airline Reservation Center $89,500

Cellular Service Activation $41,000

ATM Service Fees $14,500

Collateral damage of downtime is moreper Gartner Group

Company Direct Cost Collateral Damage

eBay > $5,000,000 Dramatic Mkt cap reduction

ATT > $10,000,000 ~$40 million in rebates

+SLAs

Collateral damage is more serious than temporary loss of business

Collateral damage severity increases as business moves online

Making “availability” five 9s, has cost too

Old rule of

thumb:

1st 80%

• 20% of Cost

Last 20%

• 80% of Cost

Per IMEX Research

There must be tradeoffs

Per IMEX Research

Finding the crossover point is key

90% 99% 99.90% 99.99%99.999% 100%

Sys

tem

Co

st

Percent Available

Excessive Downtime Costs

Exces

sive

System

Cos

ts

System UptimeRequirements

Annual Business Downtime Cost

How: Thorough Environment Knowledge

Systems

Hardware

Software

Data

Productivity

Direct cost of downtime and collateral

damage

What about disasters & downtimeNot if, when

There will eventually be a major

interruption of your business

environment

Test, test, test

Whatever your business continuity plans

Make sure you can recover your business

in the event of a failure

Test, test, test• One end-user claims to backup to tape every month,

except he backs up onto the same tape every time,

even when the system asks for a new tape

Reasons cited by European Enterprises for invocation of Business Continuity Plans

From 1997-2000 Hardware Failure 60% Software 16% Power Outage7% Bomb 3% Fire 3% Flooding 3% Environmental 2% Telecom Failure 1% Denied access 1% Miscellaneous 4%

Hardware Failure Software Power Outage

Bomb Fire Flooding

Environmental Telecom Failure Denied Access

Miscellaneous

Reasons cited by USA Enterprises for invocation of Business Continuity Plans

From 1997-2000 Regional Event 40% Hardware Failure 36% Software 10% Power Outage4% Bomb 2% Fire 2% Flooding 2% Environmental 1% Telecom Failure 1% Denied access 1% Miscellaneous 1%

Regional Event Hardware Failure Software

Power Outage Bomb Fire

Flooding Environmental Telecom Failure

Denied Access Miscellaneous

How does all this relate to SANs?

SANs have become the critical path of “high availability” or five 9s.

When an application server fails• Only the users using that app are affected

When shared storage goes down• Users of the applications using that storage are

affected

When the SAN goes down• All users are affected

Complete availability vs. high availability w/reduced capabilities

Five 9s w/no loss of capabilities• Full Bandwidth all the time w/no pr

Five 9s w/reduced capabilities • Reduced Bandwidth

• Higher probability of path congestion

Similar to differences between RAID 0,1

& RAID 5

Five 9s SANs with full capabilities

Director class

switches

Full bandwidth

between

• Initiators & target

storage

• Even with a failure in

the Director or fabric

Five 9s SANs with reduced capabilities

Core/edge networking

Oversubscribed B/W

Path failures mean• Auto failover

• Reduced B/W

• Increased possibilities of congestion

Fabric Comparison or Red Herring?

96 Port Resilient

Core/Edge Fabric

128 Port Fault

Tolerant Director

Fabric or 128 Port

Dual 64 Port Directors

Core SwitchEdge Switch

Using16-port

switches

CoreEdge

Directors vs. Core/Edge Switches

Directors - five 9s fully capable

Cost ~ $2,500/port Mask failures

• Apps never know it fails Full B/W even with failures Simple to set up & manage Fault tolerant Network: up to 239

switches/directors Up to 256 ports/director

• Can be Core or Edge switch

Switches - five 9s, w/reduced failure mode capabilities

Cost ~ $1,000/port Oversubscribed B/W

• Congestion statistically unlikely Failures mean loss of B/W More difficult to set

up/manage Fault resilient Network: up to 239 switches Up to 64 ports/switch

• Can be Core or Edge Switch

Reality Check

Core/edge & Directors are not mutually exclusive• Models can & should be mixed

Some apps cannot handle fabric disruptions of any kind

Some fabrics can never ever have reduce capacity

Some apps do not have to have full B/W all the time

Fabric Design “five 9s” Factors

The larger the switch/director nodes• The less likely there will be inter-switch/director traffic

• The more oversubscribed your fabric can be w/o increased

risk

• The more important “HA” becomes in the node itself

FSPF has limited failover capabilities• The loss of a path in the fabric (ISL failure) will cause

failover

• Failover may not be fast enough to avoid SCSI device

timeout

• Edge device retransmissions or failover must be designed in

The Key is determining where to implement with what & when

Use the same ROE as before

Thorough knowledge of the data &

environment• Hardware, software, systems, etc.

Match the type of SAN to the application

What you should do

Educate yourself about your data & environment

Design your SANs to meet the needs of the business

• Provide five 9s with full capability for those apps that need it

• Provide five 9s with less than full capability for those apps that

don’t need it

Making your entire SAN environment completely five

9s w/no loss of capabilities could be cost prohibitive

SAN Design Methodology

TransitionData

Collection

Data Analysis

ArchDevelop

Prototype and Test

Release toProduction

Add / Change/Remove /Mgt /Trouble shoot

Design Implementation Maint.

Upgrade / Architectural change

Other tools you can use

Interactive online high availability

interrogator• Helps determine the cost of your downtime

White papers• http://www.available.com