opennebula conf 2014 | opennebula and moosefs for disaster recovery: real clouds in real life -...
DESCRIPTION
In this talk I will present an overview of what is disaster recovery, its main organizational and technical aspects and how we solved the problem of DR for many companies using a combination of OpenNebula, MooseFS and lots of duct tape. Going in detail, the presentation will show how to realistically estimate your recovery time objective (RTO) and the other essential parameters like RPO depending on your company structure and requirements; how to create a reliable, self-managing infrastructure using OpenNebuia and the MooseFS/LizardFS distributed filesystems, how to efficiently perform geographic disaster recovery (including differential and deduplicating snapshots) and the additions and changes we made to OpenNebula to help in performing remote management and support. Some specific real-life disasters will be examined, along with some hardware tools designed to help – like the portable cloud or the bomb-proof server rack.TRANSCRIPT
![Page 1: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/1.jpg)
Disaster recovery with OpenNebulaCarlo Daffara
![Page 2: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/2.jpg)
First, let me get some coffee.
![Page 3: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/3.jpg)
![Page 4: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/4.jpg)
![Page 5: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/5.jpg)
![Page 6: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/6.jpg)
“Disaster recovery (DR) involves a set of policies and procedures to enable the recovery or continuation of vital technology infrastructure and systems following a natural or human-induced disaster. Disaster recovery focuses on the IT or technology systems supporting critical business functions, as opposed to business continuity, which involves keeping all essential aspects of a business functioning despite significant disruptive events. Disaster recovery is therefore a subset of business continuity.”
![Page 7: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/7.jpg)
80% of businesses affected by a major incident either never re-open or close within 18 months (Source: Axa)
![Page 8: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/8.jpg)
From “Understanding the Cost of Data Center Downtime: An Analysis of the Financial Impact on Infrastructure Vulnerability”, Ponemon Research
![Page 9: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/9.jpg)
“Let’s begin with one very interesting fact. According to a survey completed in 2010, human error is responsible for 40% of all data loss, as compared to just 29% for hardware or system failures. An earlier IBM study determined data loss due to human error was as high as 80%” (From: Business continuity and disaster recovery planning for IT professionals”, Elsevier press, 2014)
![Page 10: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/10.jpg)
![Page 11: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/11.jpg)
![Page 12: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/12.jpg)
![Page 13: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/13.jpg)
The recovery time objective (RTO) is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity.
The recovery point objective (RPO), is the maximum tolerable period in which data might be lost from an IT service due to a major incident.
![Page 14: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/14.jpg)
“Alternative storage-based replication solutions cost a minimum of $10,000 per terabyte of data covered plus ongoing maintenance. For the composite organization’s 225 protected VMs with an average size of 100 gigabytes (GB), the three year costs for licenses and maintenance are estimated at $328,500” (Forrester research, “The Total Economic Impact of VMware vCenter Site Recovery Manager”, 2013)
![Page 15: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/15.jpg)
3 simple rules to make a working DR:
![Page 16: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/16.jpg)
Rule 1: never put all eggs in one basket (be it hardware, software, cloud)
![Page 17: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/17.jpg)
![Page 18: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/18.jpg)
Customer buys full DR and snapshot capability from local data center; data center updates SAN firmware and loses everything. Customer discovers that snapshots and backups were kept in the same SAN with everything else.
![Page 19: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/19.jpg)
![Page 20: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/20.jpg)
In electronics, an opto-isolator, also called an optocoupler, photocoupler, or optical isolator, is a component that transfers electrical signals between two isolated circuits by using light. Opto-isolators prevent high voltages from affecting the system receiving the signal.
![Page 21: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/21.jpg)
![Page 22: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/22.jpg)
Rule 2: RTO and RPO are usually different from VM to VM
![Page 23: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/23.jpg)
![Page 24: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/24.jpg)
![Page 25: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/25.jpg)
Needs to be replicated constantly
No one cares if this dies
![Page 26: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/26.jpg)
![Page 27: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/27.jpg)
![Page 28: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/28.jpg)
Rule 3: design a reliable oracle
![Page 29: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/29.jpg)
![Page 30: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/30.jpg)
![Page 31: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/31.jpg)
Oracle of Delphi
![Page 32: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/32.jpg)
How the others do it:
![Page 33: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/33.jpg)
![Page 34: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/34.jpg)
![Page 35: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/35.jpg)
How we do it:
![Page 36: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/36.jpg)
![Page 37: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/37.jpg)
Our approach takes advantage of three individual factors:● LizardFS’ thinly-provisioned snapshots● online replication of chunks & tiering● OpenNebula’s datastores
![Page 38: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/38.jpg)
![Page 39: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/39.jpg)
![Page 40: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/40.jpg)
# An example of configuration of goals. It contains the default values.
1 1 : _2 2 : _ _3 3 : _ _ _4 4 : _ _ _ _5 5 : _ _ _ _ _
# (...)
20 20 : _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
# But you don't have to specify all of them -- defaults will be assumed.
# You can define your own custom goals using labels if you use them, e.g.:# 14 min_two_locations: _ locationA locationB # one copy in A, one in B, third anywhere# 15 fast_access : ssd _ _ # one copy on ssd, two additional on any drives# 16 two_manufacturers: WD HT # one on WD disk, one on HT disk
![Page 41: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/41.jpg)
● Most disasters are “local”, for example a fire in the server room or a flood
● Two different DR sites, one near (eg. next building/other side of the building) and one far (external datacenter)
● near DR receives a copy of the chunks that are part of the marked datastores
![Page 42: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/42.jpg)
![Page 43: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/43.jpg)
● Remote snapshots are handled in the same way: we take a full snapshot of the datastore, and differentially replicate it
● We use the “snapshot of snapshot” approach to avoid the cost of deduplication
● This way we can prioritize sync queues, and in the receiving end we got a complete and decoupled + working OpenNebula
For example, average dedup cost for ZFS: 5 to 30 GB of dedup table data for every TB of pool data, assuming an average block size of 64K.
![Page 44: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/44.jpg)
/var/lib/one/datastore↓
DRSNAP12H
/var/lib/one/snapshots↓
<yyyymmddhh>↓
DRSNAP12H
LocalVM changes only in
snapshots
/var/lib/one/datastore↓
DRSNAP12H
/var/lib/one/snapshots↓
<yyyymmddhh>↓
DRSNAP12H
Remoteno chunk changes
in snapshots
inplace rsync
(25x speedup)
![Page 45: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/45.jpg)
![Page 46: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/46.jpg)
virsh# domblkstat instance-0012 --device vda
vda rd_req 128vda rd_bytes 2344448vda wr_req 234vda wr_bytes 618496vda flush_operations 2vda rd_total_times 106512819vda wr_total_times 960359872vda flush_total_times 1741727
![Page 47: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/47.jpg)
Our “pilot light” approach: a running OpenNebula on two nodes, with its own LizardFS store. Running only two VMs: the Oracle and the TesterThe Oracle checks if DR is needed, and may need a human confirmation for execution of the DR failover. If confirmation is given, it takes the latest valid snapshotted datastore, softlinks it and import the VMs (through snapshots, so it’s instantaneous)The Tester makes a snapshot of the current stable snapshot, import the VMs and runs them into a separate, non-routed vnet, then executes a test to see if everything works (workload dependent), then deletes the intermediate snapshots
![Page 48: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/48.jpg)
Only critical VMs are executed this way, if RTO<30 minsFor the VMs with higher RTO, buy one week of hardware on demand, auto-install a node with Puppet or Ansible, and make it join the OpenNebula cloud
Deployed usually in 30 mins. Other vendor guarantee <15 minutes.
![Page 49: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/49.jpg)
![Page 50: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/50.jpg)
![Page 51: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/51.jpg)
Ideal for harsh indoor environments that require protection from falling dirt or liquid, dust, light splashing, oil or coolant seepage. Its NEMA Zone 4 rating also makes it perfect for facilities located in earthquake-prone seismic zones or any environment prone to extreme vibration such as factories, power stations, construction areas, shipping facilities, warehouses, processing plants, railroads, airports and military installations.
![Page 52: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/52.jpg)
![Page 53: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/53.jpg)
![Page 54: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/54.jpg)
● Have a “big red button” to stop DR if needed. Sometimes you are already fighting fire here, and you know it’s better not to move everything in flight.
● Have two people that are competent as DR firefighters, and give them a second phone with a rechargeable card. And make sure both don’t go on vacation together. (Hint: don’t choose two married people)
![Page 55: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/55.jpg)
● Use a gateway machine to provide a consistent internal IP scheme, and two different configurations for the gateway router to provide unmodified routing for the remaining VMs
● Aggregate functionality in a single VM (for example, one that manages logs) to optimize writes
![Page 56: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/56.jpg)
● I favor consistency, so I tend to avoid application-level replication, unless it’s native to the app (eg. NoSQL). Otherwise you have different solutions for different machines (eg. quorum group in MS replication with same UUID…)
● Try to reduce write amplification for databases, especially MySQL. Eg. TokuDB and its fractal tree
![Page 57: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/57.jpg)
![Page 58: OpenNebula Conf 2014 | OpenNebula and MooseFS for disaster recovery: real clouds in real life - Carlo Daffara](https://reader030.vdocuments.site/reader030/viewer/2022020218/559a6ac31a28abe1348b4829/html5/thumbnails/58.jpg)
Thank you!
Carlo Daffara@cdaffara
linkedin.com/in/cdaffara