protecting your data with remote data replication solutions · 2019-12-21 · protecting your data...
TRANSCRIPT
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE
Protecting Your Data with Remote Data Replication
Solutions
Fausto Vaninetti
SNIA Europe Board of Directors
(Cisco)
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE
Table of Contents
Protecting Your Data with Remote Data Replication Solutions ........................................................ 1
Achieving data protection ............................................................................................................... 1
RAID and RAIN .......................................................................................................................... 1
Local, Metro, Geo ....................................................................................................................... 2
Remote Data Replication for Fibre Channel-Based Disk Arrays ...................................................... 4
Advanced TCP/IP Stack ....................................................................................................................... 5
Optimization and Efficiency in IP-based Storage Replication Solutions ......................................... 6
More To The Game ............................................................................................................................... 9
Summary ............................................................................................................................................ 10
Last Page Address Section ............................................................................................................... 11
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 1
September 2015
Protecting Your Data with Remote Data Replication Solutions
Achieving data protection
No one doubts the amount of data being generated across the world is increasing exponentially.
Data generated by all organizations is stored, mined, transformed and utilized seamlessly. Data
represents a critical component for the operation and function of organizations and consequently
data protection methodologies are required to avoid disruptions in business operations. In fact,
every company should consider their data as the second most valuable asset after their employees
and should implement some form of data protection.
This paper examines some of the more common and effective data protection schemes in use today,
offering a concise and simple to understand point of view. Remote data replication solutions are also
covered with some level of technical detail.
RAID and RAIN
The first approach to data protection is typically the adoption of a disk array with an embedded
specific mechanism known as Redundant Array of Independent Disks (RAID), a term dated back to
1988. In short, this is a data virtualization technology that combines multiple disk drives into a logical
group for the purpose of data protection (and performance improvement as well). Data is
distributed across the set of drives according to the desired RAID level schema and a specific balance
is achieved among reliability, performance and capacity.
RAID is categorized according to levels. The Common RAID Disk Data Format specification by SNIA
defines a standard data structure describing how data is formatted across the disks in a RAID group
for every RAID level. Some of the primary RAID levels are shown in the table below:
With RAID levels higher than 0, damage to individual disk sectors or the failure of one or more hard
disks can be tolerated, still preserving data integrity. Data is not actually copied, but rather
complemented with an amount of redundancy so that the original data can be reconstructed via
appropriate mathematical algorithms even if a limited portion of data becomes unavailable due to
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 2
failure. In order to improve performance without sacrificing fault tolerance, the use of a fast cache
to front end the RAID group has become the norm, both on servers and within disk arrays. This
explains why RAID 5 and RAID 6 have become very popular implementations in our age. It is also
worth mentioning that solid state disks, instead of magnetic disks, are the new trend and, according
to many analysts, this represents the single biggest revolution in the storage industry since a long
while.
Recently a variation of the RAID implementation got industry attention. This time multiple copies of
data are spread across multiple computational nodes. To hold naming similarity with the previously
described mechanism, this approach has been referred to as Redundant Array of Independent Nodes
(RAIN) and it forms the bases of data protection within multiple commercially available
implementations of Big Data Hadoop clusters and Hyper-converged systems.
All of the above solutions are in wide use by virtually all organizations and do provide effective local
data protection. Nevertheless, in order to protect from local site disasters, a copy of all data should
also be stored in a properly identified alternative location. Whether it is your secondary datacenter
or a third-party managed datacenter, remote data replication across a distance is the way to go.
Some organizations are also keen to keep copies of data on different physical media, in an effort to
further minimize chances of concurrent disruption related to the technology itself. When a time
lapse of 24 hours between production data and their copy is acceptable, tape backups can also be
used as another option for data protection. Tapes can store huge quantity of data at a fraction of the
cost of disk arrays, consume negligible power and are compatible with the highest security standards
requiring that tape cartridges get stored inside underground bank vaults for long term retention.
Organizations tend to use tape backups as a complement to disk-based remote data protection
solutions.
Local, Metro, Geo
The essence to data protection is to securely store multiple copies of data onto independent physical
media. Doing this within a single datacenter is clearly a local solution. If something goes wrong with
the facility (flood, fire, hurricanes, power black-out, sabotage), data can be inaccessible or even
completely lost. Having a copy of data in another location removes the criticality of single site
disaster. Interest in data recovery solutions is well demonstrated by surveys with CIOs and further
underlined by recent forecasts that indicate Disaster Recovery as a Service (DRaaS) as one of the
fastest growing segments for the cloud business.
The secondary site has to be carefully chosen and outside the so called “threat radius” so that
chances of any failure affecting both datacenters at the same time are negligible. As a result,
distances above 300 km are the norm when looking for a true protection from natural calamities or
sudden and unforeseen major system failures.
Organizations with an even higher requirement for data availability and uptime have now adopted
the three site approach, whereby twin datacenters are deployed within short distances and both of
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 3
them are active at the same time to achieve business continuity. The third site is far away and used
for simple data recovery needs or true disaster recovery purposes. In this situation, failure of one of
the twin datacenters will not prevent the business from remaining up and running. Applications will
always be on and no downtime will be required to recover them after the failure. Technically this
can be expressed as a Recovery Time Objective (RTO) equal to zero.
Within the twin datacenters, data is kept in-synch and can be assumed to be identical in both sites.
In fact, every write needs to be acknowledged by both storage arrays before being considered
completed. This imposes a practical restriction on the maximum distance between the two locations
and in the range of about 100km. Longer distances would affect application performance, driving it
down to unacceptable levels.
From the point of view of the synchronous replication software in use, it would actually be better to
consider an upper limit in round trip latency rather than distance. To some degree, this depends on
the vendor of choice, but a valid rule of thumb calls for 2 msec as the limit for these kinds of metro
implementations. As a matter of fact, when round trip latency exceeds 8 msec, it is clearly going to
be considered a geographical implementation and data replication is achieved asynchronously. In
fact, writes are considered completed when acknowledged only by the local storage array and then
data will get transferred to the remote disk array with a little delay. In other words, data in the two
locations is slightly different and the copy is lagging the source. The temporal difference between
them is more appropriately called Recovery Point Objective (RPO).
For both cost and technical reasons, nearly all geographical solutions rely upon the Internet Protocol
(IP) for transport across the Wide Area Network (WAN). This is the true realm of disaster recovery,
where both RPO and RTO values are above zero and, if disaster strikes, a manual intervention is
required to transition activity to the secondary site with an expected downtime for applications to
recover. There are many factors involved in choosing the correct remote data replication solution for
any specific business, like the amount of data that can be lost, time taken to recover, distance
between sites and so on.
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 4
Remote Data Replication for Fibre Channel-Based Disk Arrays
Fibre Channel (FC) has been the technology of choice for storage connectivity since the inception of
storage networks and even today, despite being far beyond hype-cycle and no more high on the
press, it still dominates over alternatives in terms of adoption for shared external disk arrays.
For Fibre Channel based disk arrays, two main alternative approaches for remote data replication
using IP are currently available.
The first one leverages dedicated IP replication ports on the disk array itself, whereby servers access
their local data via FC fabrics but the remote connection between peer disk arrays go straight to the
IP network. Clearly, this method implies the availability of a sufficient number of native IP ports on
the disk array and this condition is not always met. The second option makes use of a multi-service
appliance that not only provides local FC switching capabilities but also enables FC encapsulation
within IP packets for optimized transmission through the Wide Area Network (WAN). A variation of
this approach sees the same functionality hosted on a specific line-card within a highly available FC
modular switch, known as director.
In most but not all cases, data transmission is mono-directional, from the production datacenter
toward the disaster recovery site. Twin datacenters with active/active operation or occasional data
recovery situations may require data to flow in the opposite direction as well.
Companies should carefully evaluate the rainbow of technical solutions on the market by
confronting them based on some decision criteria that include performance, security, flexibility,
reliability, diagnostic tools and price. Price should not be the main decision factor since a consistent
disaster recovery project requires an overall level of investment that by far exceeds the price of the
data replication solution alone, whatever it may be. For large organizations with large storage
environments, the adoption of IP replication ports on disk arrays may not be optimal. The number of
IP replication ports that can be used concurrently on disk arrays is limited, and in any case lower
than the number of 16G FC ports connected toward the production fabrics. Potentially this will
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 5
create a bottleneck since aggregate throughput on 16G FC ports on most arrays far outperforms the
capability of their native 10G IP counterparts.
For their flexibility and performance as well as the capability to use a single remote data replication
solution for multiple disk arrays from different vendors, Fibre Channel over IP (FCIP) encapsulation
engines are the most used implementation to extend a Storage Area Network (SAN) across
geographically separated datacenters. Moreover, FCIP is not limited to remote data replication. It
supports other applications, such as centralized SAN backup and data migration over very long
distances. As a matter of fact, when distances go up, it becomes impractical, or very costly, to rely
upon native Fibre Channel connections, eventually transported over optical transmission equipment.
FCIP tunnels, built on a physical connection between two SAN extension switches or blades, allow
Fibre Channel frames to pass through the existing IP WAN. The TCP connections ensure in-order
delivery of Fibre Channel frames and lossless transmission.
The Fibre Channel fabric and all Fibre Channel targets and initiators are unaware of the presence of
the IP WAN. The TCP/IP stack ensures all data lost in flight is retransmitted and placed back in order
prior to being delivered to upper layer protocols. This is an essential feature to prevent SCSI
timeouts for open system-based replication. This stack is also capable to automatically and quickly
adjust traffic rate on the WAN connection between the user-defined min and max bandwidth values.
In other words, a feedback mechanism ensures that the quality of the long distance IP link will
dynamically affect the FCIP transmission rate, permitting optimal throughput for all flows. Evidently,
the user-defined min bandwidth value should be carefully chosen so that it does not exceed the
available bandwidth on the WAN link.
As a best practice, this minimum bandwidth should be available at all times because the need for
replication may arise at any time. This can be achieved by either specifically reserving bandwidth for
FCIP or by having sufficient bandwidth available that far exceeds the current needs for all uses.
Furthermore, whenever possible, adopt a reliable IP connection that drops very few packets since
the performance of FCIP, as any high performance TCP connection for that matter, greatly depends
on a low retransmission rate.
An enterprise-class remote data replication solution should excel in performance (achieved
throughput, tolerated latency, packet drop handling), monitoring (port and flow visibility and
statistics) and diagnostic capabilities (ping, trace-route, logging) and the group of advanced features
that are the main constituents would start from a sophisticated TCP/IP stack.
Advanced TCP/IP Stack
Despite software implementations on top of a general purpose processor would be possible, the
required performance and reliability levels that disaster recovery projects impose are considerable.
For that reason, most solutions use hardware assisted implementations, where custom ASICs sustain
the most demanding computational tasks like compression or encryption.
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 6
A valid remote data replication solution should be able to operate both in asynchronous and
synchronous mode. In the first case, the most typical one, distances up to 10,000 km should be
supported to address the needs of multinational companies with datacenters in multiple continents.
For synchronous replication, latency is a gating factor and extra care is required in order to minimize
it. Not all solutions are equal in this respect.
Data needs to be encapsulated before transmission over a long distance IP network. In general, the
efficiency of the chosen transport method depends on its capability to reduce overhead by filling
datagrams to the supported Maximum Transmission Unit (MTU). This creates maximum payload per
unit of overhead.
The best approach is to use “frame batching” so that a stream of data frames (typically 4 or 64 of
them) is worked on at the same time, compressed and fit into the available MTU size. When a single
data is compressed and mapped to an Ethernet frame, wasted payload bytes cause inefficiency and,
consequently, higher overhead for the same traffic. In this case, the bigger the MTU the better it is:
Jumbo frames up to 9000 bytes are preferred to the standard MTU of 1500 bytes when best
performance is desired. Since a Fibre Channel frame can be up to 2148 bytes, and considering some
margin for additional headers, an MTU size of 2300 bytes would be the minimum recommended
value to use.
Some implementations also incorporate a way to determine the maximum MTU all the way to the
remote target with a feature known as Path MTU Discovery. PMTUD is described in RFC 1191 and
works well with pure L3 networks. It is worth mentioning that the TCP Maximum Segment Size (MSS)
is slightly smaller than Ethernet MTU size in order to accommodate for TCP and IP headers.
In the end, data frames need to go through the TCP/IP stack and here is where some solutions may
fall short of expectations due to technical trade-offs. On one side, a long distance IP network poses
challenges on available bandwidth, available paths and packet drops. On the other side, data
replication dislikes instabilities and variability and would prefer a guaranteed bandwidth with no
packet drops. Distance, so latency, has a negative effect on throughput. Put simply, with standard
TCP/IP, information transfer suffers the farther you go. This is because of the flow control
mechanism that is part of TCP protocol. In fact, link latency and the waiting for acknowledgment of
each set of packets sent will prevent long and fat pipes to be efficiently utilized.
Optimization and Efficiency in IP-based Storage Replication Solutions
For these reasons, an efficient remote data replication solution cannot fall short of a purpose-built
WAN optimized TCP/IP stack. Thanks to that, it becomes possible to achieve wire-rate transmission
on high speed links, with application throughput up to 1250 MBytes/s for 10 Gbps ports. It is also
possible to overcome 100+ msec of latency on the WAN and tolerate excessive jitter, bit errors and a
loss of 1 out of 2000 transmitted packets. Experience has shown that general purpose WAN
optimization devices cannot provide better performance than purpose-built remote data replication
solutions, but rather introduce complexity, another point of failure, and another asset to configure,
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 7
manage, monitor, and troubleshoot. Moreover, being general purpose, they would have no specific
storage protocol awareness (FC, SCSI) and consequently fail to add real value to the solution.
Transmitting data over an IP network avoids the constraints and distance limitations suffered by
native Fibre Channel links whereby a Buffer-to-Buffer Credit mechanism is used to make sure
packets are not lost due to congestion from source to target. The burden for proper handling of
congestion situations and flow control in general is offloaded to the TCP layer and its native
capabilities.
One of them is the transmission window size, dynamically adjustable in response to WAN conditions.
In TCP, the amount of outstanding unacknowledged data that is needed to fully utilize a WAN
connection is tightly associated with the Bandwidth Delay Product (BDP), derived by multiplying link
bandwidth and link round trip time. A solid remote data replication solution will have a large BDP
value, even in excess of 120MB, and avoid any drooping effect over long and fat pipes.
Another optimization that becomes handy when more efficiency over the WAN is required is known
as Selective Acknowledgment (SACK) and described in RFC 2018. Although TCP was a very robust and
adaptable protocol since the very beginning, it has gone through several iterations to enhance its
ability to perform in high latency environments coupled with high bandwidth. The goal is to
minimize the TCP control traffic and allow it to recover faster from any dropped frames.
The standard TCP protocol implements reliability by sending a cumulative acknowledgment for
received data segments that appear to be complete and in sequence. In case of packet loss,
subsequent segments will not get acknowledged by the receiver and the sender will retransmit all
segments after the packet loss is detected. This behavior is pretty inefficient since it leads to
retransmission of segments which were actually successfully received and provokes a sharp
reduction in the congestion window size so that subsequent transmissions will happen at a slower
rate than before. By using the SACK mechanism, a receiver is able to selectively acknowledge
segments that were received after a packet loss. The sender will now have the capability to only
retransmit the lost segments and fill the holes in the data stream.
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 8
More often than not, organizations try to save money on the WAN connectivity service by enabling
compression before sending data across. The simple idea is to transmit the same amount of data
over a lower bandwidth link. The compression engines are typically based on the well-known
“deflate” algorithm as described in RFC 1951, even if derivative implementations provide a different
trade-off of throughput vs. compression ratio. The achieved results are very dependent on the data
to be compressed but a good implementation is normally capable of a 4:1 compression ratio for real
data (not test data).
Of course last but not least, with the ever-increasing amount of data generated across the globe,
there is also a clear trend toward high-speed remote replication solutions. Up to a couple of years
ago Gigabit Ethernet speeds were adequate at least for most companies; nowadays the sweet spot is
certainly at 10 Gigabit per second with 40 Gigabit Ethernet looming as the next candidate for market
adoption.
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 9
The TCP/IP implementation on enterprise-class remote data replication solutions is clearly optimized
for carrying storage traffic. It can accommodate long and fat pipes, avoid the low throughput, slow
start behavior of normal TCP implementations and recover more quickly from packet loss, as
described within several documents including RFC 1323 (Window Scaling) and RFC 5681 (Slow-start,
congestion avoidance, fast retransmit, and fast recovery). It also employs variable, per-flow traffic
shaping that yields high instantaneous throughput while minimizing the possibility of overruns on
downstream routers.
More To The Game
Security can be an added feature of the chosen implementation. By using 256-bit keys and
hardware-assisted encryption engines in compliance with the Advanced Encryption Standard (AES),
high performance can be achieved despite the complexity of algorithms. Various situations
determine where is best to apply encryption for data in-flight: for example, if only disk-to-disk
replication traffic needs this level of security, it can be advantageous to enable it on the dedicated
remote data replication solution. If other traffic needs to be encrypted between the two
datacenters, it is preferred to enable encryption on the exit datacenter routers where wire-speed
encrypted traffic on 100G ports is now possible. Alternative implementations on hosts or dedicated
security engines or DWDM muxponders are also available, but they don’t offer the same benefits in
real world deployments and are confined to more specific use cases.
Ideally the same remote data replication solution would be capable of both open system logical disk
replication and mainframe volume replication, providing a consistent and homogeneous response to
both FC and FICON replication needs. This capability, sometime referred to as multimodality or
FC/FICON intermix, helps justify the investment in high-performance extension technologies since it
can now be leveraged across the enterprise to include mainframe volume replication and tape
vaulting in addition to a variety of open system disk replication solutions and tape libraries.
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 10
Large-scale storage deployments often require support for multimodality (disk, tape, open system,
mainframe), heterogeneous arrays, large bandwidth, high throughput, nonstop operations, tools for
administration and configuration, and robust diagnostics. Some leading SAN extension solutions can
accommodate all of these requirements and allow them to be managed by different administrator
groups within an enterprise, by using INCITS T11 Virtual Fabrics (VF) technology for logical
partitioning and Role Based Access Control (RBAC) for user profiling and privileges’ assignment. This
is a warm welcome to multi-tenancy for storage area networks.
For high availability considerations, it is also recommended to architect the overall solution in such a
way that replication traffic can still operate during firmware upgrades and single replication port or
device failures. That is why link aggregation groups are configured and where equipment
redundancy comes into play.
The remote data replication network can be incorporated into production FC fabrics or kept
separate. Separation can be achieved logically or physically, using INCITS T11 Virtual Fabrics (VF)
technology or dedicated devices. When physical separation is desired, the disk array will host
onboard dedicated FC ports for replicas, connected to the SAN extension network. In small
environments, the disk array will have a limited number of FC ports and all of them will be shared for
production as well as replication traffic. In this case, the SAN extension appliance will need to
provide specific functionalities in order to avoid merging the SAN in the primary datacenter with the
secondary one, so that issues on the remote site, or even the WAN, will not negatively affect
production traffic.
Now that migration from IPv4 to IPv6 addressing is underway in many datacenters, IPv6
compatibility is also a very reasonable ask for any modern remote data replication solution.
Depending on situations, where strong asymmetry in scale (and budget) for the two datacenters
exist, there can also be a need for support of mismatched speeds on the WAN ports at both ends of
the replication link, so that 10G is used in one location and 1G in the other one. IP sub-interfaces and
VLAN tagging are extra features that are sometime required for a properly architected solution.
Cloud providers and Hosted Managed Service providers tend to make use of these capabilities when
offering Storage Private Clouds to their customers.
Most replication protocols today support unsolicited writes and thus require a single round trip to
write data to a remote disk array. If they don’t, multiservice FCIP engines can provide ad-hoc
acceleration capabilities to properly compensate for that. The industry has thus developed a wide
range of specialized acceleration solutions, falling under the name of Write Acceleration, Read
Acceleration, Tape Acceleration, Input/Output Acceleration and the likes.
Summary
Organizations looking for a remote data protection solution across geographically separated
datacenters can nowadays choose among a variety of options, including Fibre Channel based disk
array IP replication and multiservice appliances. Enterprise-class features and low TCO (Total Cost of
Protecting Your Data with Remote Data Replication Solutions
2015 STORAGE NETWORKING INDUSTRY ASSOCIATION EUROPE 11
Ownership) represent valid decision criteria, just like multi-modality and multi-tenancy. The ability
to integrate into any IP network without special tuning considerations is enabled through an
optimized TCP/IP stack and resulting capabilities to handle glitches over the WAN. Ease of
configuration and comprehensive management tools help providing insight and end to end visibility
for proper performance assessment and troubleshooting. FCIP has emerged over alternative
protocols and since many years purpose-built FCIP devices represent the preferred solution for
remote data replication, especially for medium and large companies. Thanks to this technology, it is
now possible to alleviate the distance barrier and achieve secure local replication performance over
long distances.
About the SNIA Europe
SNIA Europe advances the interests of the storage industry by empowering organizations to
translate data and information into business value by promoting the adoption of enabling
technologies and standards. As a Regional Affiliate of SNIA Worldwide, we represent storage
product and solutions manufacturers and the channel community across EMEA. For more
information, visit http://www.snia-europe.org/.