five 9s for sans w/o breaking the bank presented by marc staimer president & cds (chief dragon...
TRANSCRIPT
Five 9s for SANsw/o Breaking the Bank
Presented by Marc StaimerPresident & CDS (Chief Dragon Slayer)Dragon Slayer Consulting
What is Five 9s & What does it Really Mean?
0
2000
4000
6000
8000
10000
12000
98.000 99.000 99.900 99.990 99.999
Scheduled or Unscheduled Downtime Minutes
What does “Availability” mean?
Availability is the proportion of time that
a system can be used for productive work
Then what does “five 9s” mean?
Scheduled & Unscheduled downtime does
not exceed ~ 5 minutes per year
Perspective: Annual downtime =• Less time than it takes to drink a cup of coffee
• 1/6th the time of the average daily commute
What about Four 9s or less?
Four 9s = ~ an hour of downtime/yr
Three 9s = ~ 9 hours of downtime/yr
Two 9s = ~ 4 days (88 Hours) of
downtime/yr
Can you live two, three, or four 9s? …it Depends
On the Application
The types of outages you can live with
The cost of downtime for those
applications
The cost of high availability such as five
9s.
Application Availability Dependencies
Mission criticalness
Productivity loss from downtime
Alternatives
Outage dependencies
You may be able to live w/two 9s if:• There are 88 separate outages of 1 hour each
through the year
It is a different story if it is 1 outage
nearly 4 days• This could put a business out of business
Direct costs of downtimeper Gartner Group
Industry Average Loss/Hr.
Brokerage Operations $6,450,000
Credit Card Authorizations $2,600,000
E-commerce $240,000
Package Shipping Services $150,250
Home Shopping Channels $113,750
Catalog Sales Center $90,000
Airline Reservation Center $89,500
Cellular Service Activation $41,000
ATM Service Fees $14,500
Collateral damage of downtime is moreper Gartner Group
Company Direct Cost Collateral Damage
eBay > $5,000,000 Dramatic Mkt cap reduction
ATT > $10,000,000 ~$40 million in rebates
+SLAs
Collateral damage is more serious than temporary loss of business
Collateral damage severity increases as business moves online
Making “availability” five 9s, has cost too
Old rule of
thumb:
1st 80%
• 20% of Cost
Last 20%
• 80% of Cost
Per IMEX Research
Finding the crossover point is key
90% 99% 99.90% 99.99%99.999% 100%
Sys
tem
Co
st
Percent Available
Excessive Downtime Costs
Exces
sive
System
Cos
ts
System UptimeRequirements
Annual Business Downtime Cost
How: Thorough Environment Knowledge
Systems
Hardware
Software
Data
Productivity
Direct cost of downtime and collateral
damage
What about disasters & downtimeNot if, when
There will eventually be a major
interruption of your business
environment
Test, test, test
Whatever your business continuity plans
Make sure you can recover your business
in the event of a failure
Test, test, test• One end-user claims to backup to tape every month,
except he backs up onto the same tape every time,
even when the system asks for a new tape
Reasons cited by European Enterprises for invocation of Business Continuity Plans
From 1997-2000 Hardware Failure 60% Software 16% Power Outage7% Bomb 3% Fire 3% Flooding 3% Environmental 2% Telecom Failure 1% Denied access 1% Miscellaneous 4%
Hardware Failure Software Power Outage
Bomb Fire Flooding
Environmental Telecom Failure Denied Access
Miscellaneous
Reasons cited by USA Enterprises for invocation of Business Continuity Plans
From 1997-2000 Regional Event 40% Hardware Failure 36% Software 10% Power Outage4% Bomb 2% Fire 2% Flooding 2% Environmental 1% Telecom Failure 1% Denied access 1% Miscellaneous 1%
Regional Event Hardware Failure Software
Power Outage Bomb Fire
Flooding Environmental Telecom Failure
Denied Access Miscellaneous
SANs have become the critical path of “high availability” or five 9s.
When an application server fails• Only the users using that app are affected
When shared storage goes down• Users of the applications using that storage are
affected
When the SAN goes down• All users are affected
Complete availability vs. high availability w/reduced capabilities
Five 9s w/no loss of capabilities• Full Bandwidth all the time w/no pr
Five 9s w/reduced capabilities • Reduced Bandwidth
• Higher probability of path congestion
Similar to differences between RAID 0,1
& RAID 5
Five 9s SANs with full capabilities
Director class
switches
Full bandwidth
between
• Initiators & target
storage
• Even with a failure in
the Director or fabric
Five 9s SANs with reduced capabilities
Core/edge networking
Oversubscribed B/W
Path failures mean• Auto failover
• Reduced B/W
• Increased possibilities of congestion
Fabric Comparison or Red Herring?
96 Port Resilient
Core/Edge Fabric
128 Port Fault
Tolerant Director
Fabric or 128 Port
Dual 64 Port Directors
Core SwitchEdge Switch
Using16-port
switches
CoreEdge
Directors vs. Core/Edge Switches
Directors - five 9s fully capable
Cost ~ $2,500/port Mask failures
• Apps never know it fails Full B/W even with failures Simple to set up & manage Fault tolerant Network: up to 239
switches/directors Up to 256 ports/director
• Can be Core or Edge switch
Switches - five 9s, w/reduced failure mode capabilities
Cost ~ $1,000/port Oversubscribed B/W
• Congestion statistically unlikely Failures mean loss of B/W More difficult to set
up/manage Fault resilient Network: up to 239 switches Up to 64 ports/switch
• Can be Core or Edge Switch
Reality Check
Core/edge & Directors are not mutually exclusive• Models can & should be mixed
Some apps cannot handle fabric disruptions of any kind
Some fabrics can never ever have reduce capacity
Some apps do not have to have full B/W all the time
Fabric Design “five 9s” Factors
The larger the switch/director nodes• The less likely there will be inter-switch/director traffic
• The more oversubscribed your fabric can be w/o increased
risk
• The more important “HA” becomes in the node itself
FSPF has limited failover capabilities• The loss of a path in the fabric (ISL failure) will cause
failover
• Failover may not be fast enough to avoid SCSI device
timeout
• Edge device retransmissions or failover must be designed in
The Key is determining where to implement with what & when
Use the same ROE as before
Thorough knowledge of the data &
environment• Hardware, software, systems, etc.
Match the type of SAN to the application
What you should do
Educate yourself about your data & environment
Design your SANs to meet the needs of the business
• Provide five 9s with full capability for those apps that need it
• Provide five 9s with less than full capability for those apps that
don’t need it
Making your entire SAN environment completely five
9s w/no loss of capabilities could be cost prohibitive
SAN Design Methodology
TransitionData
Collection
Data Analysis
ArchDevelop
Prototype and Test
Release toProduction
Add / Change/Remove /Mgt /Trouble shoot
Design Implementation Maint.
Upgrade / Architectural change
Other tools you can use
Interactive online high availability
interrogator• Helps determine the cost of your downtime
White papers• http://www.available.com