logical architecture for protection
TRANSCRIPT
1
Physical Models and Logical Architecture
Sunita ShrivastavaVijay Sen
2
Work in Progress, Needs your input• Work In progress
– Intent is to collect feedback and hear your views– Will need your help to drive this to its logical conclusion– Several drill downs required
• Why is this significant?– Historically we did data protection/backup in the following backdrop : no virtualization, no cloud,
low end storage technologies– New Scenarios around data protection are now a lot more feasible due to the availability and
maturity of these technologies– We need to understand our existing assets– Intent is to not redesign the entire code but to be able to think through this systematically. And
come up with the following answersa) Are we handling the new scenarios in an optimal way ?b) What are the common constructs/shared meta data across the scenarios and hence across the subservices ?
Are we building silo’d solutions ?c) Is a layered AND highly scalable AND ha solution is feasible? Where and how do we change gears?d) If so, what would a roadmap for transition to this look like?
3
Proposed plan
What’s our vision?
What assets do we have
today?
What are the new relevant
scenarios?
What are the limitations
today?
What is the new
architecture?
Exit criteriaBenefits of the new arch well
understood; Split between platform and management
bought off by Windows team
and leads
Exit criteriavNext scenarios
signed off
Exit criteriaPerceived
strengths and weaknesses
identified for the assets
Exit criteriaComprehensive vision objectives
and non-objectives agreed
upon by staff
Exit criteriaGaps between
scenarios required and
assets available identified
4
High Level Scenarios
• Backup• HA ( geo clustering)• Disaster Recovery– Rehydration
• Application Migration • Archival
Principles
• Unified management across data protection, DR and migration• Tailored to application owners and hosters• Hybrid cloud awareness• Enterprise class offering • Alignment with windows and CDM• Our team innovates in management, but leverages replication
technologies• We are a platform and an end to end solution (management is not
extensible, but we are a platform for other replication providers)
Guiding principles
Non goals
• Support for non-Microsoft clouds• TBD
6
Hoster Debate?
• Couple of options– Provide a stack to hosters which allows them to
offer a recovery service all within their data center, where they provide storage• Another variation would be that Azure Storage is used
– Provide a stack to hosters which allows them to leverage the “Recovery Service”(running in Azure) for both backup and DR
– A combination of the above
7
Existing Assets• Windows Server 8 Backup (Full Server, Critical Volumes(BMR), System State, Individual Volumes, Files/Folders)
– Strengths • Free, Simple, Sweet Spot – 8 to 10 machines, Used in departmental/branch office servers in Enterprise Scenarios, Primary Ask is BMR
– Weaknesses• Clustering Support, Centralized Console/Monitoring
• Client Backup– File (zip file based, for compat?)– System Restore (snapshot based, file level recovery, only backs up settings/registry/system files, affiliated with app/driver installs)– System Image Backup– History Vault (File level restore, uses Shadow Copies)
• DPM– Strengths
• Adoption in midmarket• SQL is the largest workload?
– What do we add over the SQL technologies as a value proposition?• Exchange the largest workload in Enterprise?• What about Sharepoint support?• Can we make a claim that due to strong recovery models (item level) , customers like to use DPM when protecting our apps?• Can we make a claim otherwise?
– Weaknesses• Blockers for adoption in large Enterprises
– Tape Support– Need lower data loss intervals for mission critical applications– No support for de-duplication - Need more analysis– Is Scale a blocker?
» Current Deployment Scale for customers : Fan of 10/13 servers, atmost 3 to 4 DPM servers » We are at the cusp for scale, as demonstrated by 64 node clusters where storage separation is necessary» DPM Limit 80 TB for recovery volume and 40 TB for replica volume
8
Existing Assets(2)• OBS
– Strengths• A service on Azure that is designed for scale and availability
– Weaknesses• No monitoring of whether backup is actually happening• Large footprint, worker roles could be shared for higher utilization, fewer cogs and
smaller footprint• What have we learnt from the early exposure
• DPM as a Gateway to Cloud– Strengths
• Great Story for off-site protection of data– Weakness
• Scale Models ?
9
Gaps • Non Optimal Data Movement• Storage coupled too tightly with servers, not fungible across servers• Lack of a Single Unified Protection Namespace
– For Backup, DR, HA– For different Segments
• Not Hoster Friendly• Silo’d services in the Enterprise
– No coherent SLA’s• No application awareness• Our Resources not as leveraged as they can be
• IMHO, – a single protection namespace is the single biggest investment that we need to make from a management
perspective– Layering over a replication service from a platform perspective– Full fledged support, workflows for recovery– Storage service for storage management is an investment we need to support hosters and large enterprises
10
Introducing Protection Namespace• A protection namespace hosts protected(or a protectable) element
– A binding is applied when a protectable element is made protected– The binding specifies how a protected element is protected
• Protectable Element– <Name, PEType>
• Protected Element– <Name, PEType, ProtectionType, Target Recovery Service URI, Destination URI(optional)>
• User Specifies <Disaster Loss Tolerance, RPO, RTO, Destination Preference> on the basis of which a protection type is chosen
• A protection namespace is rooted at the level of tenancy/subscription• The leaf nodes are protected elements• Container nodes only serve organizational perspective within a namespace
– Is a flat space good enough?
• What can be imported into the namespace? – Can there be policies to automatically discover protectable elements and import them automagically?
• Questions• What is the relationship of this namespace to other potential elements or constructs within System Center/Tofino?
11
Understanding Protection Namespaces• Providing a unified view across the Protection Namespace will prevent fragmented silo’d solutions which require a lot of
bookkeeping• Protection Name Space (Sliced By Node/Clusters)
– <Source Type, Source Name(URI), Protection Type, Target Recovery Service URI, Status>– Node A
• <Host, ‘A’, Windows Server Backup, Disk Z, Green>• <Host, ‘A’, Windows Server Backup, Network Share Z, Green >
– Node B• <Folder, ‘Folder xyz’, Snapshot Replication, Azure Storage, Green>
– Node C• <Volume, ‘Volume a’, Snapshot Replication, Enterprise
– Node D• <VM, ‘VM abc’, Hyper-V Replication-Hot, Target Host Node X, Green>• <VM, ‘VM cde’, Snapshot Replication, Azure Storage, Green>• <VM, ‘VM fgh’ Hyper-V Replication- Cold, Azure Storage,Green>
– Node E• <SQL DB, DB ‘bcd’ SQL Logging, Green>
– Cluster E• <VM, ‘VM efg’, Snapshot Replication, Yellow>• <VM, ‘VM cde’, Hyper-v Replication, Hot, Target Host Cluster G>
– Cluster F• <All VMs protected by Snapshot Replication, Status Green>
– SAN G• <SAN Volumes, Volume A to G, SAN Replication, Target SAN>
– Node H• <Volume,
12
Protection Namespaces and Apps• Protection Name Space (Sliced By Application) <Protection Type, Schedule,
Status>– Application XYZ (Replace with real life example, Hrweb?)
• Web Tier– Node A– Node B– Node C
» < Windows Server Backup, Once in 15 days, 3 Recovery Points>• Middle Tier
– Node E– Node F
» <Windows Server Backup, Once in 15 days, 3 Recovery Points>• Data Tier
– Cluster VMs– < Hyper-V Replication, 5 minutes, 15 Recovery Points>
• Protection Name Space Chaining• <SQL DB A, SQL Synchronous Replication, SQL DB B>• <SQL DB b, SQL Asynchronous Replication, SQL DB C>
13
Protection Namespace
• Single Server– Local Protection Namespace handled by the Local Recovery
Service– Should we allow Publishing
– Optionally, can publish to the Recovery Service or can publish to the Recovery Service(Azure)
• Do we really need a protection namespace?– Is it beneficial to the Application Owner/Administrator?– Is it beneficial to the Hoster (Fabric/Service Provider)?– Is it beneficial to the Fabric Administrator within a Data
Center?
14
Application Migration and Protection Namespace
• What is the relationship of the Application Migration to Protection/Recovery Service
• Catalog/VSS Writer could aid in Application Discovery?– Unknown
• Commonalities– Application Migration Equals “IR + Simplified Recovery + Hydration”– Failover to cloud in case of Disaster Recovery is essentially equivalent of
Application Migration in terms of requirements around the ambience required by the application• A migrated application may need a VPN • An application failing over to the cloud may need a VPN to be configured
– In the long term, what do we as a Protection/Recovery Team need to do to ensure that the application can be protected appropriately as it is migrated
15
Three segments
• Enterprise • Cloud – Azure Services
• Hybrid– A unified namespace across the enterprise and
cloud
16
Management Tasks for the Protection Namespace• The success of the entire solution depends not only on plumbing but the ease with which data protection needs of a customer
can be met
• There are plenty of significant Management Tasks – Management of Protection NameSpace
• Protection Name Spaces– Contain Protected Elements
» Application, VM, Volumes, Collection of Volumes– Hierarchical, Nested
» By Location(Site), By Cluster, By Node» By Application
– Overlapping ?
– Management of Protection Policies– Simplifying Policies
» Stock Policies ?» Based on Intent and Calibration ?
– Provisioning for Replication– Driving the underlying replication
– Monitoring the Status of Protection– Given the policies and the namespaces, alert if things are not on schedule
– Orchestration for Recovery of Data• Indexing and Cataloging for Efficient Retrieval• Orchestration of Recovery
– Management for Disaster Recovery• Reserving Fabric• Testing of Fail Over and Fail back to Primary Site
– Orchestration of Hydration
– Management of Storage, Bandwidth, Fabric, Networking for all of the above
17
VMs and Our Scenarios • Why do VMs require backup?
– VM corruption?– Guest Level backup Bs VM Level Backup
• However, it is important to understand where guest level replication makes more sense• Where does a combination of guest and vm level backup make sense
• VMs lend themselves more easily to migration– DR drives virtualization, as DR requires migration
• Definition #1 (For Azure?) – Cold Backup
• Medium to Low Recovery Time, Low Data Tolerance – Hot Backup
• Low Recovery Time + Low Data loss Tolerance
• Definition #2 (Applicable for Private Clouds)– Cold Backup
• Low Data Tolerance, Fabric is not reserved, Recovery may get long, however resources may be shared more effectively– Hot Backup
• Low Recovery Time, Low Data Tolerance, Fabric is reserved
18
Physical Model (Enterprise Only)
Protection MetaData
Protected Data
Protection Service
Recovery Service
Hydration Service
Subscription Service
Protection MetaData
Site X
DAS
Protected Node
Protected Data(Possibly SAN, Possibly
NAS)
Archived Data
SAN
Fabric Mgmt Service
Cloud Service (Self Service)
Site Y
Protected DataProtected
Data
DR Service
Archival and Reporting Service
Protected Cluster
Recovery Cloud(Private)
Protected Node(NAS)
CSV Volumes
Protected Node(Hyper-V Host)
Storage Service
VM
Cloud Storage
Recovery Node (Hyper –V Host)
Application
Application
DAS
Protection Service
Recovery Service
Hydration Service
Policy/Monitoring Service
DR Service
Archival and Reporting Service
VMHot Backup
Cold Backup
Storage Service
App Migration
Server Backup
Cold Backup
Large Storage Backup
Catalog Service
19
Cloud Storage
Azure Blobs – For Data
SQL Azure – For MetaData
Production Cloud(Private)
Physical Model (Direct to Cloud)Site X
DAS
Protected Node
SAN
Fabric Mgmt Service
Cloud Service (Self Service)
Site Y
Protected Cluster
Recovery Cloud(IaaS)
Protected Node(NAS)
CSV Volumes
Protected Node(Hyper-V Host)
VM
Cloud Storage
Recovery Node (Hyper –V Host)
Application
Application
DAS
Protection Service
Recovery Service
Hydration Service
Policy/Monitoring Service
DR Service
Archival and Reporting Service
VMHot Backup
Hydration
Storage Service
Large Storage Backup
Cold Backup
VMHydration
20
Windows Server
Logical Architecture
Recovery Service (Web Tier)
Protection Service
Recovery Service (Data Tier)
Retrieval Service
Catalog Service
Disaster Recovery Service
Infrastructure Services(Subscription(tenancy), Transport, Jobs, Networking)
Storage Service(Protection
Data)
Replication Service
Replication Provider
Replication Provider
(Snapshot)
App Migration
Service
Recovery Service(Job Service)
Recovery Provider Recovery
Provider Off line
Recovery Providers
Local Recovery Service
Hydration Service
VSS Providers
Replication Provider(Hy
per –v)
VSS ProvidersVSS
Providers
Recovery Service Portal
Migration Service Portal
Data Post Processing
Roles
Storage Provider (Hyper-V R)
Storage Provider (Modified VHD Writer)
Xport Provider (File Write)
Catalog ProvidersCatalog
Providers
21
Windows Server
Hyper-V Example(Enterprise DR)
Recovery Service (Web Tier)
Protection Service (Data
Tier)
Enterprise Storage Service(Protection
Data)
Replication Service
Replication Provider
Replication Provider
(Snapshot)
Protection Service(Job Service)Local Recovery Service
VSS Providers
Replication Provider(Hy
per –v)
VSS ProvidersVSS
Providers
Xport Provider (Hyper-V R VM)
Xport Provider (Modified VHD Writer)
Recovery Service Portal
Migration Service Portal
Windows Server
Local Recovery Service
Xport Provider (File Write)
Data Post Processing
Role
Catalog ProvidersCatalog
Providers
22
Windows Server
Hyper-V R To Cloud Example
Recovery Service (Web Tier)
Protection Service (Data
Tier)
Azure Storage (Protection Data)
Replication Service
Replication Provider
Replication Provider
(Snapshot)
Recovery Service(Job Service)
Local Recovery Service
VSS Providers
Replication Provider(Hy
per –v)
VSS ProvidersVSS
Providers
Xport Provider (Hyper-V R Cloud)
Xport Provider (Modified VHD Writer)
Recovery Service Portal
Migration Service Portal
Xport Provider (File Write)
Catalog ProvidersCatalog
Providers
Hyper-V Data Post Processing
Role
23
Site Protection
24
Windows Server Backup Example (?)
25
SQL Replication Example
26
Application Migration Sharepoint Example
27
Fine Grained Recovery From Hyper-V
28
• What other examples do we need
29
Capabilities of Components• Replication Provider
– Capability Profile• Supported Protected Element Types• Min Data Loss Tolerance Window• Max Data Loss Tolerance Window• Application Consistency Support
– Requirement profile• Require Off site Post Processing - Should this be a Xport Provider Requirement?
• Recovery Service Profile– Capability Profile (Are these per protected Element Type)
• Recovery Time• Recovery Points in Time• Retention Time• Encryption At Rest• Supported Offline Recovery Providers
• Storage/Xport Provider Profile– Capability Profile
• Client Side Encryption• Which Recovery Service are they affiliated to?
– Recovery Service in Cloud --- storage is in cloud– Recovery Service in Enterprise (DPM vnext) – storage is in Enterprise– Recovery Service in another node or Cluster – data is stored in storage local to that node/cluster
• Is there a notion of a Recovery Mgmt service that can provide to other Recovery Services for keeping their metadata and cataloging
30
Major Components• Subscription Service : Create a Protection Name Space for a given customer• Protection Service : Allows creation of Protected Elements within a
protection namespace• Catalog Service : Provides for creation of a catalog for protected elements
for a given protection namespace• Recovery Service : Allows recovery of data for a protected element in a
protection name space• Hydration Service : Uses the recovery service to hydrate VMs in a private
cloud or to Azure • Job Service : Performs long running tasks submitted by the main services
and provides the infrastructure to monitor their progress• Data Post Processing Roles : A replication provider can register a data post
processing role to process data before it is stored
31
Components (Client Side)
• Recovery Service (Agent) : Manages/Orchestrates the processes in providing protection for a protected element and associates it with a recovery service.
• Replication Service : Provides the framework/platform for different replication providers to plugin
• Replication Provider• Xport Provider• Catalog Provider• VSS Writers
33
Benefits
• Sets the Framework for a unified namespace for Backup/DR/HA
• Create a Hoster Friendly Stack– Hosters should want to deploy our stack in their datacenter
to provide value added offerings– Retain a model where Hosters can also easily leverage Azure
resources for their recovery scenarios– Need to understand what kinds of extensibilities they would
need beyond building their own portal– Over time we have a mostly unified codebase written to the
service model
34
Roadmap/Next Steps
• Next steps– Build a roadmap, possibly multi-release, to get
there• Vteams to discuss and iterate over this
35
Plausible Roadmap• V Next
– Build the Protection Mgmt service for the Azure segment (Protect on Azure)• Align with Tofino• Notion of application definition or service template
– How do we leverage and align with that?
– Evolve DPM to be the protection mgmt service for the Enterprises/Hosters • Adopt the OBS/Service architecture that supports multi-tenancy• Be the platform of choice for hosters to adopt to provide data protection services
to their customers• Ensure that it works seamlessly with OBS service to provide geo-protection using
the Azure Cloud Storage
• V Next Next– Figure out the evolution of components to serve the hybrid cloud or the
combined namespace
36
The Data Replication Problem• Limiting Factors
– Throughput at the sending and the receiving side– Storage at the processing side
• Consists of the following parts– IR, Change Tracking and Data Movement– Catalog
• IR– Can we avoid/circumvent the problem by the use of published well known images?
• Change Tracking– Data must be self descriptive
• Data Movement– Channel
• Must Implement Push and Pull• Selection of EndPoint Listener
– Azure Replication Storage Service (cloud backup for VMs)– Private Cloud Replication Storage Service– Hyper-v Host Replication Listener (hot backup)
• Negotiate for compression• Encryption on Wire• Support Throttling
• Catalog
37
A Layered Architecture ? Possible?Description Responsibilities
(Replication Layer)Responsibilities (Data Protection Layer)
Pros/Cons
Extensibility at Source Replication Layer Solely focused on change Tracking
1. Enable Change Tracking
2. Notification of handlers for safe transmission and persistence of data
1. Authentication2. Transmission
Format 3. Provide the acks
required as per the replication protocol
Extensibility at the listener
1. Change Tracking2. Provide a set of
listeners at the destination end of the channel
3. Authentication for the channel
4. Formats of Transmission
Two Models here :a) Data Protection Layer controls persistence b) Data Protection Layer preps the storage and the replication layers writes directly to the storage
Transmit change data to a Specified Listener
The entities at the two end of the channel agree to protocol
38
Replication Provider Profile• Min Data Loss Tolerance Window• Max Data Loss Tolerance Window• Application Consistency Support• Recovery Time
– This depends more on the state in which the most up-to-date copy is kept
• Recovery Points in Time– To some extent this is not a capability of the provider but a limit
imposed by the storage or driven by requirements• Retention Time
– Not really a capability of the provider
39
Requirements for Coupling Replication Providers and Storage Providers
40
Basic Interaction • User tells the Mgmt Layer the source he needs to protect,
specifies the SLAs(Data Loss Tolerance, RPO, RTO and Retention Requirements)
• Mgmt Layer queries the Replication Service – Replication Service queries the replication providers which have
registered with it– Returns the provider – Mgmt Layer will provide the choice to the users
• Mgmt will ask the Replication Service to configure for replication with the user’s choice
• There is an initial handshake with the listener endpoint where queries for storage are negotiated
41
Appendix
42
Windows 8 Storage Investments• Windows Storage Pools : Storage virtualization over commodity disks but
providing advanced capabilities– Spaces : Virtual disks created off of storage pools
• Offloaded Data Transfer– Copy is offloaded to the intelligent storage array
• SMB Scaleout – SMB Direct : Clients need a NIC with RDMA capability– SMB Multipath : Adds robustness– SMB VSS for Remote File Shares
• CSV – – Available for application workloads, integrated with storage pools, thin provisioning,
smb scale out, support for fully featured VSS• Data De-duplication
– On server : how does dedup compare to our compression– On Host : DPM 2012 can handle deduped
43
Replication Comparison• Hyper-v Replication
– Provides low data loss tolerance and write order consistency– Depends on MSCS clustering
• Not very resilient to primary host failure (Will require resync)• Not very resilient to replica Failure• Buffers will overflow, Doesn’t have log folding
– Doesn’t separate Staging of VMs from data storage • Replica Server may be receiving data for some VMs and at the same hosting a VM that has failed over
– How will it leverage storage deduplication?• Snapshotting and USN based File Tracking Mechanisms
– USN based file change tracking mechanisms coupled with volume snapshotting help extract the changes between two snapshots– File System Filter Driver helps tracks the file blocks that have changed– Resync’s are required if tracking is upset– More resilient to DPM server outage– Snapshotting on the receiving side is a blocker for scale --- how many concurrent vss snapshots can a server perform across different volumes?
• Chained Snapshotting helps utilize epoch based recovery • Each snapshot representing an epoch
• Data Loss Tolerance– For Hyper-v, scsi writes are copied into a buffered log pretty much continuously– For DPM, copy on write is enabled during the interval that buffered copies happen
• So,– How low can we squeeze the data loss tolerance with DPM?– How high can we squeeze the data loss window with hyper-v R? – We need instrumentation of data, ideally we should be able to compare the same workload– We can calibrate the workload and intent and chose….but then
• What happens when the workload changes?
44
Catalog• Catalog – Historical, Tells what the high level contents of a
backup are – This essentially provides for browsability before full recovery is
undertaken• The meta data for the structure/high level contents of an application
structure is a part of the data associated with a certain recovery point however the catalog can help you with identification of which recovery point may have the data of interest
– Can the catalog information be handed down the VSS snapshot process• We expect the catalog to be tree structured • This can be huge for a large application• In such cases, can the applications be responsible for keeping an up-to-date
catalog?
45
Replicated Content Format
• DPM stores the content uncompressed/unencrypted uses VSS snapshots as a mechanism to create point-in-time copies
• Hyper-v R supports VHD 2.0, data is not encrypted at rest but may be encrypted for transmission, data is not compressed
• OBS supports a modified VHD 1.0 (meta data is vhd 1.0, blocks are compressed and encrypted at rest)
• We are doing some tests on how much extraction, encryption and decompression add to the recovery time