virtualisation frustration - nec
TRANSCRIPT
8/8/2019 Virtualisation frustration - NEC
http://slidepdf.com/reader/full/virtualisation-frustration-nec 1/8
Coping with SAN Storage Frustration Caused byServer VirtualizationAdvanced Storage Products Group
Page 1 of 8
Table of Contents
1 – Introduction
2 – Server Virtualization &
Networked Storage Issues
6 – NEC’s D-Series Storage
8 – Conclusion
Introduction
Server virtualization is clearly one of the breakout technologies in the first decade of the 21st
century. The unambiguous value of virtualizing servers comes from the measurable increase
in hardware utilization and superior application deployment flexibility, which in turn facilitate
highly cost effective server consolidation. The results are both operationally and
economically compelling, however, they are not without their pitfalls as server virtualization
— unless properly managed — can result in difficulties with SAN storage.
Server virtualization provides highly desirable availability capabilities such as transparent live
migration of applications without disruption, live migration of storage volumes without
disruption, near instantaneous recovery of down machines at a local or remote site, on-
demand allocation of hardware resources based on QoS policies per virtualized guest, and
non-stop availability in the event of a hardware failure. These capabilities along with even
basic hypervisor functionality necessitate networked storage for hypervisor functionality and
ease of management. However, this is where the frustration begins.
Server virtualization is incredibly easy: creating virtual server guests is as simple as point,
click, and create. Networked storage for virtualized servers however is anything but easy for
most systems. Key challenges for storage admins include:
• Dealing with excessive storage network oversubscription;
• Effectively managing virtual server performance;
• Reducing virtual server scheduled downtime as a result of provisioning and volume
expansion;
• Deciding convoluted storage tradeoffs in reliability, availability, performance, and
cost for virtualized mission critical application data; and
• The deadly effects of silent data corruption.
These are not trivial issues and must be dealt with for each and every virtual server guest.
8/8/2019 Virtualisation frustration - NEC
http://slidepdf.com/reader/full/virtualisation-frustration-nec 2/8
Coping with SAN Storage Frustration
Server Virtualization & Networked Storage Issues
Excessive storage network oversubscription
An odd thing occurs far too frequently when IT organizations move their applications from
physical servers to virtual ones: the virtualized application performance degrades
considerably and sometimes comes to a screeching halt. This phenomenon is perplexing
and confusing especially after a pilot performed without a hiccup during testing.
Troubleshooting the root cause of the performance problem can be an extensive exercise in
frustration. The reason is a bit complex and after much head scratching, can often be traced
to excessive oversubscription.
Oversubscription is the assignment of more potential utilization demand than resources can
possibly handle if all that potential utilization demand were to take place simultaneously. The
underlying assumption for most storage networks is that this possibility is exceedingly
remote. When this assumption is true, oversubscription makes sound financial sense
because it demonstrably increases resource utilization. Increased resource utilization means
significantly lower capital (less hardware and software) and operating (less maintenance, real
estate, power, cooling, and management) expenditures.
However, when server virtualization is implemented on a wider scale this assumption
unexpectedly becomes invalid and bad things happen when there is excessive
oversubscription. Excessive oversubscription means that it becomes probable for utilization
demand to exceed the available resources. When that happens, the resource in question
cannot respond quickly enough — or perhaps not at all — to utilization requests. (As an
analogy, a commonplace example of oversubscription is when you dial a telephone number
and receive a “fast busy” signal.) When network oversubscription level is greater than the
current resources can handle, an inability to handle the request results in the equivalent of a
storage network “fast busy” signal.
Oversubscription is the value principle behind server virtualization’s much increased server
hardware utilization. The performance problem occurs when the oversubscribed virtualized
servers are overlaid on top of oversubscribed networked storage. Instead of being additive,
the oversubscription problem is multiplicative. Here is a simple example to more clearly
illustrate this concept:
Common best practices for SANs have the target SAN storage array ports oversubscribed on
average at an 8:1 ratio, meaning there are 8 physical server initiator ports connecting to
every SAN storage array target port over the SAN fabric. For hypothetical purposes, let’s
make those physical servers virtual servers instead, with 5 virtual guest servers each. If the
SAN storage array target port oversubscription doesn’t change, the actual oversubscription
ratio has just increased by 5x making it 40:1. This means that the odds of a given target port
being busy have also increased by 5x, and those virtual guests are not going to get the
storage performance they were expecting. And if the target ports get too busy, the SCSI
protocol is not known for being forgiving and will eventually time out. This event means the
virtual application will be told its storage is not there, and it will crash.
Page 2 of 8
8/8/2019 Virtualisation frustration - NEC
http://slidepdf.com/reader/full/virtualisation-frustration-nec 3/8
Coping with SAN Storage Frustration
Figure 1: 40:1 Oversubscription Example
Another aspect of virtualized server oversubscription that tends to get overlooked — even
though it occurs far too frequently – is LUN (or volume) oversubscription. Let’s explain howthis occurs: the server virtualization hypervisor has a storage virtualization capability, and
this capability allows the hypervisor to take physical volumes assigned by the storage admin,
and carve them into virtual volumes for virtual machine guests. Problems occur when
multiple virtual applications start pounding on those virtual volumes simultaneously. Each
virtual application is requesting I/O from what it believes are its own unique, dedicated disks
when in reality, all virtual applications are being serviced by the same shared disks.
This form of oversubscription can get very ugly very fast. I/O requests are queued by both
the storage system and the disks. As queues fill — which happens much faster with SATA
than SAS or Fibre Channel drives because of their much smaller queue depth — response
times degrade. When queues become full, virtual application I/O requests do not getserviced at all (the equivalent of a “fast busy” in the telephone analogy). Once again the
SCSI protocol will very likely time out and crash the virtual application.
Too much virtual application scheduled downtime
When any storage process is application disruptive or requires an outage for maintenance, it
has to be scheduled to occur when the application downtime is minimized. Storage
provisioning and volume expansion are traditionally application disruptive, and their
requirements usually means scheduling such transactions in the wee hours of the morning,
on weekends, or on holidays to minimize impact on business operations.
It is easy to see why this scheduling requirement might become frustrating to virtualized
server users. Whereas creating virtual server guests is easy — point, click, create, and go
— provisioning SAN storage is typically anything but easy. Most storage admins find storage
provisioning today extremely convoluted. Provisioning is a considerable storage admin
burden and time sink that continues to expand, and users commonly have to wait days,
weeks, months, or sometimes longer, to get new storage capacity made available to them.
Why is this the case? It’s because storage provisioning tasks are many, detailed,
meticulous, difficult, and human error prone. There are tradeoffs with every decision and
Page 3 of 8
8/8/2019 Virtualisation frustration - NEC
http://slidepdf.com/reader/full/virtualisation-frustration-nec 4/8
Coping with SAN Storage Frustration
Page 4 of 8
each decision takes time to implement. As each hypervisor virtual application is provisioned,
it has no access to its storage and remains offline until the storage provisioning process is
complete. When there are human errors, the storage provisioning process takes longer.
The longer the provisioning takes, the greater the virtual application disruption. And with few
server admins possessing storage knowledge or provisioning skills, they have to rely on the
storage administrator to get their storage provisioned.
After volumes are provisioned, expanding them is also problematic. Expanding a volume
consists of taking the volume offline, adding drives, possibly restriping the RAID group, then
informing the attached operating system and application that additional capacity is now
available. While this is obviously painful for traditional physical servers, it is far more painful
for virtual servers because so many more applications are affected each time a volume is
taken offline to be scaled.
Convoluted storage tradeoffs
Tradeoffs are endemic with storage networking for virtualized servers: if you increase
reliability and availability, performance takes a hit. Optimize performance levels instead, andyour data may not be safe. Try to optimize for all three — performance, reliability AND
availability — and cost goes through the roof.
Application performance takes a noticeable hit (as much as 50%) with common RAID arrays
during a drive rebuild. Drive rebuilds can be tolerable if the rebuild is scheduled when
application activity is low. However, drive rebuilds extend the timeframe in which there is an
increased possibility of an unrecoverable read error during the rebuild, and an unrecoverable
read error in a RAID5 drive rebuild will have the catastrophic result of permanently lost data
— a much more common problem than you might at first suspect.
The lower-cost, high capacity SATA drives have a 10^-14 bit error rate (BER). When thereare five drives in the RAID5 set (a common RAID5 drive ratio), the chance of an
unrecoverable read error becomes 32%. Higher cost SAS or Fibre Channel drives with their
lower BER of 10^-15, reduce that probability to 3.2%. Even so, the risk of data loss with
RAID5 protection levels is most likely unacceptable for mission critical applications, which
usually means deploying RAID6 is the next option to consider.
RAID6 protects against the loss of data if there is an unrecoverable read error or other RAID
set drive failure during a drive rebuild. Regrettably, the performance of the storage array
during a 2nd
drive rebuild is often halved again, which is most likely also unacceptable for a
mission critical application.
That leaves RAID1 or RAID10 (with RAID 10 being a RAID1 set striped across the disks for
better performance.) RAID1 or RAID10 is the most expensive and generally most common
RAID set for mission critical or high IOPS applications. In the event of an unrecoverable
read error or fault, recovery is incredibly fast since there is a duplicate copy of the data in a
mirrored disk, and there is no hit to performance during the rebuild of the 1st
disk. Sounds
good so far, but what the everyday virtualized server administrator may not know is that
those expensive, high performance RAID1 or RAID10 sets are subject to non-recoverable
8/8/2019 Virtualisation frustration - NEC
http://slidepdf.com/reader/full/virtualisation-frustration-nec 5/8
Coping with SAN Storage Frustration
data loss in the event of a 2nd
drive failure, and that the risk of a 2nd
drive failure is an order of
magnitude lower than RAID5. This level of risk will still be unacceptable to many virtual
mission critical applications, leaving admins in a quandary.
These issues are again far more severe in a virtualized server environment because the total
number of virtualized mission critical applications running on each virtual server multiplies theproblem. If data is corrupted or lost, it is not one mission critical application admin who will
be calling, it will be many.
The deadly effect of silent data corruption
SATA drives have been a boon to the explosive growth of data because of their low cost,
high capacities, and reasonable duty cycles. Numerous organizations are finding SATA very
attractive in virtualized server environments for these very reasons and because of the way
hypervisors virtualize attached storage. Because hypervisor storage virtualization capability
carves physical volumes into virtual ones, the large capacity and low cost of SATA drives is
quite appealing. Many vendors are now recommending SATA for virtualized server
environments.
However, SATA nirvana is not reality: early adopters have discovered a disturbing problem
unique to SATA hard disk drives called “silent data corruption.” Silent data corruption occurs
when a read failure is not identified or resolved by the storage array. Despite high levels of
data integrity built into the system, when silent data corruption occurs, corrupt data is passed
by the storage array to the application without any notification or warning. This situation is
very disconcerting because it can lead to serious incorrect application behavior and results.
Silent data corruption can occur from misdirected writes, partial writes, and data path
corruption. This phenomenon is exacerbated by parity pollution, which is when RAID parity
is calculated using corrupt data making the original data irretrievable and the data corruption
undetectable until it is too late.
Figure 2: SATA Drive Silent Data Corruption Parity Pollution
Page 5 of 8
8/8/2019 Virtualisation frustration - NEC
http://slidepdf.com/reader/full/virtualisation-frustration-nec 6/8
Coping with SAN Storage Frustration
Page 6 of 8
Unfortunately, silent data corruption has received little press coverage even though it has
been cropping up more and more frequently and causing more data problems. The problem
has become large enough, however, that a number of Federal agencies have mandated that
no storage system purchase can be made unless it eliminates or at least meaningfully
protects against silent data corruption. For most organizations this means a big hit to the
storage budget: instead of purchasing low-cost, high-capacity SATA drives, they have to
implement more medium-to-high-cost (and lower capacity) SAS or FC drives.
Even more unfortunate is that SAS and FC drives are not safe from silent data corruption
either. While the drives suffer data corruption with greater infrequency, the impact of data
corruption on the high-cost drives can be more significant because typically those drives are
used for more important data and applications.
It is easy to understand why so many organizations are frustrated with their storage and
virtualized servers and are wondering if there is a better way: thankfully, there is.
NEC’s D-Series Storage ArraysNEC thoroughly understands the issues arising from virtualized server environments and has
developed the D-Series storage arrays to resolve all of them.
Simplified performance tuning for virtual servers
Simplifying storage-based application tuning helps alleviate the performance degradation
surprise when moving from physical to virtual servers. The D-Series is designed from the
ground up to provide the flexibility required to deal with the issues of oversubscription with its
extraordinary dynamic pooling capacity, its ability to increase port count up to 64 Fibre
Channel ports, and its PerformanceOptimizer software to do just that.
To reduce the impact of storage oversubscription by virtual servers, NEC turned to a positive
form of oversubscription: Thin Provisioning, which is capacity oversubscription. Thin
provisioning allows multiple volumes to be created and allocated to virtual servers even when
there is not enough physical capacity to hold the data if all of the volumes were full. Slices of
physical capacity are allocated to the volume when the data is written, and not before. That
means less downtime and management time spent provisioning storage.
NEC’s dynamic pool function allows administrators to rapidly eliminate oversubscription to a
volume. As soon as the admin realizes virtualized application performance is degrading, he
or she can add disks online, one at-a-time, to the RAID group and LUN without ever takingthe storage or applications offline, and without data migration or impacting performance.
The ability to increase Fibre Channel port counts tackles target port oversubscription. When
target ports are oversubscribed, more ports and modules can be added on the fly to relieve
that type of oversubscription problem as well.
8/8/2019 Virtualisation frustration - NEC
http://slidepdf.com/reader/full/virtualisation-frustration-nec 7/8
Coping with SAN Storage Frustration
The D-Series PerformanceOptimizer software automates virtual application performance
tuning by identifying disk pool hotspots and allowing the administrator to non-disruptively
move logical disks from that hotspot to a less utilized set of disks.
Eliminate convoluted tradeoffs in storage performance, reliability, availability,
and cost for the virtualized mission critical applicationsAs previously discussed, these tradeoffs present a prickly problem. NEC has come up with
powerful solutions that can’t help but make other vendors think: “why didn’t we think of that!”
For the vast majority of applications that require dual drive fault protection, NEC’s D-Series
SAN Storage has implemented silicon-based RAID6, which, unlike software based RAID6,
minimizes performance degradation during drive rebuilds.
And for those virtualized mission critical applications that require double fault protection and
zero tradeoffs in performance, NEC has developed two unique hardware based RAID sets:
the first is RAID-Triple Mirror (or RAID-TM) and the second is RAID3 double parity (or RAID-
3DP).
RAID-TM provides the very high speed of a RAID1 set while providing the very high reliability
of the RAID6 set accomplished by writing data simultaneously to three separate drives.
Even if there are two drive faults or unrecoverable read errors in the same mirror, the
application still has access to its data with no degradation in performance even as the drives
are rebuilt.
Figure 3: RAID-TM
The second RAID set unique to NEC is RAID3 Double Parity. RAID3 Double Parity is pretty
much what it sounds like: it provides the very desirable striped performance of RAID3 while
Page 7 of 8
8/8/2019 Virtualisation frustration - NEC
http://slidepdf.com/reader/full/virtualisation-frustration-nec 8/8
Coping with SAN Storage Frustration
Page 8 of 8
adding the double parity of RAID6. This combination allows up to 2 disk failures or
unrecoverable read errors to occur without any loss or data and nominal performance loss
while drives are rebuilt.
Eliminates deadly effect of undetected, unreported, silent data corruption
This reliability area is where the D-Series differentiates itself from all other vendors with its“Extended Data Integrity Feature” (EDIF). EDIF is a standard feature that protects against
SATA-based misdirected writes, partial writes, data path corruption, and parity pollution.
More detailed information can be found in the white paper “Silent data corruption in SATA
arrays: a solution” (http://www.necam.com/storage/Contacts/?ItemID=145&wp=7).
EDIF is another clever NEC invention elegant in its simplicity. First the D-Series controller
calculates a Data Integrity Field (DIF) for each sector before the data is written to disk. In
essence, the DIF is providing a super-checksum that is stored on disk with the data, and the
DIF is checked on every read and/or write of every sector. This process identifies corrupted
data and enables corrupted data to be fixed.
Few other storage systems have an EDIF or EDIF equivalent for either physical or virtual
servers. What EDIF means is that lower cost, higher capacity SATA drives can now be used
for applications where lower performance is acceptable, but where unrecoverable data
corruption is not. EDIF makes SATA drives enterprise-capable.
Furthermore, the D-Series is one of a few arrays that implement the T10-DIF standard, which
detects and addresses silent data corruption in SAS drives. As a result, storage
administrators can rely on their storage array to keep mission critical data integrity at its
maximum.
ConclusionThere are a number of storage issues that frustrate server virtualization users including:
1. Complicated storage based application tuning when moving from physical to virtual
servers;
2. Application disruptive storage provisioning and volume scaling;
3. Unacceptable storage performance, reliability, availability, and cost tradeoffs for
virtualized server mission critical applications;
4. And the deadly effects of undetected, unreported, silent data corruption.
NEC eliminates these frustrations with its highly innovative D-Series. Any organization
implementing virtualized servers or even just looking to add SAN storage would be well-
served to take a hard look at the NEC D-Series.
NEC CORPORATION OF AMERICA
© 2009 NEC Corporation of America. All rights reserved. Specifications are subject to change without notice. NEC is a registered trademark andEmpowered by Innovation is a trademark of NEC Corporation. All other trademarks are the property of their respective owners. WP114-1_0209
2880 Scott Boulevard
Santa Clara, CA 95050
1 866 632-3226
1 408 844-1299
[email protected]/storage