logical architecture for protection

45
Physical Models and Logical Architecture Sunita Shrivastava Vijay Sen 1

Upload: sunita-shrivastava

Post on 09-Feb-2017

30 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Logical Architecture for Protection

1

Physical Models and Logical Architecture

Sunita ShrivastavaVijay Sen

Page 2: Logical Architecture for Protection

2

Work in Progress, Needs your input• Work In progress

– Intent is to collect feedback and hear your views– Will need your help to drive this to its logical conclusion– Several drill downs required

• Why is this significant?– Historically we did data protection/backup in the following backdrop : no virtualization, no cloud,

low end storage technologies– New Scenarios around data protection are now a lot more feasible due to the availability and

maturity of these technologies– We need to understand our existing assets– Intent is to not redesign the entire code but to be able to think through this systematically. And

come up with the following answersa) Are we handling the new scenarios in an optimal way ?b) What are the common constructs/shared meta data across the scenarios and hence across the subservices ?

Are we building silo’d solutions ?c) Is a layered AND highly scalable AND ha solution is feasible? Where and how do we change gears?d) If so, what would a roadmap for transition to this look like?

Page 3: Logical Architecture for Protection

3

Proposed plan

What’s our vision?

What assets do we have

today?

What are the new relevant

scenarios?

What are the limitations

today?

What is the new

architecture?

Exit criteriaBenefits of the new arch well

understood; Split between platform and management

bought off by Windows team

and leads

Exit criteriavNext scenarios

signed off

Exit criteriaPerceived

strengths and weaknesses

identified for the assets

Exit criteriaComprehensive vision objectives

and non-objectives agreed

upon by staff

Exit criteriaGaps between

scenarios required and

assets available identified

Page 4: Logical Architecture for Protection

4

High Level Scenarios

• Backup• HA ( geo clustering)• Disaster Recovery– Rehydration

• Application Migration • Archival

Page 5: Logical Architecture for Protection

Principles

• Unified management across data protection, DR and migration• Tailored to application owners and hosters• Hybrid cloud awareness• Enterprise class offering • Alignment with windows and CDM• Our team innovates in management, but leverages replication

technologies• We are a platform and an end to end solution (management is not

extensible, but we are a platform for other replication providers)

Guiding principles

Non goals

• Support for non-Microsoft clouds• TBD

Page 6: Logical Architecture for Protection

6

Hoster Debate?

• Couple of options– Provide a stack to hosters which allows them to

offer a recovery service all within their data center, where they provide storage• Another variation would be that Azure Storage is used

– Provide a stack to hosters which allows them to leverage the “Recovery Service”(running in Azure) for both backup and DR

– A combination of the above

Page 7: Logical Architecture for Protection

7

Existing Assets• Windows Server 8 Backup (Full Server, Critical Volumes(BMR), System State, Individual Volumes, Files/Folders)

– Strengths • Free, Simple, Sweet Spot – 8 to 10 machines, Used in departmental/branch office servers in Enterprise Scenarios, Primary Ask is BMR

– Weaknesses• Clustering Support, Centralized Console/Monitoring

• Client Backup– File (zip file based, for compat?)– System Restore (snapshot based, file level recovery, only backs up settings/registry/system files, affiliated with app/driver installs)– System Image Backup– History Vault (File level restore, uses Shadow Copies)

• DPM– Strengths

• Adoption in midmarket• SQL is the largest workload?

– What do we add over the SQL technologies as a value proposition?• Exchange the largest workload in Enterprise?• What about Sharepoint support?• Can we make a claim that due to strong recovery models (item level) , customers like to use DPM when protecting our apps?• Can we make a claim otherwise?

– Weaknesses• Blockers for adoption in large Enterprises

– Tape Support– Need lower data loss intervals for mission critical applications– No support for de-duplication - Need more analysis– Is Scale a blocker?

» Current Deployment Scale for customers : Fan of 10/13 servers, atmost 3 to 4 DPM servers » We are at the cusp for scale, as demonstrated by 64 node clusters where storage separation is necessary» DPM Limit 80 TB for recovery volume and 40 TB for replica volume

Page 8: Logical Architecture for Protection

8

Existing Assets(2)• OBS

– Strengths• A service on Azure that is designed for scale and availability

– Weaknesses• No monitoring of whether backup is actually happening• Large footprint, worker roles could be shared for higher utilization, fewer cogs and

smaller footprint• What have we learnt from the early exposure

• DPM as a Gateway to Cloud– Strengths

• Great Story for off-site protection of data– Weakness

• Scale Models ?

Page 9: Logical Architecture for Protection

9

Gaps • Non Optimal Data Movement• Storage coupled too tightly with servers, not fungible across servers• Lack of a Single Unified Protection Namespace

– For Backup, DR, HA– For different Segments

• Not Hoster Friendly• Silo’d services in the Enterprise

– No coherent SLA’s• No application awareness• Our Resources not as leveraged as they can be

• IMHO, – a single protection namespace is the single biggest investment that we need to make from a management

perspective– Layering over a replication service from a platform perspective– Full fledged support, workflows for recovery– Storage service for storage management is an investment we need to support hosters and large enterprises

Page 10: Logical Architecture for Protection

10

Introducing Protection Namespace• A protection namespace hosts protected(or a protectable) element

– A binding is applied when a protectable element is made protected– The binding specifies how a protected element is protected

• Protectable Element– <Name, PEType>

• Protected Element– <Name, PEType, ProtectionType, Target Recovery Service URI, Destination URI(optional)>

• User Specifies <Disaster Loss Tolerance, RPO, RTO, Destination Preference> on the basis of which a protection type is chosen

• A protection namespace is rooted at the level of tenancy/subscription• The leaf nodes are protected elements• Container nodes only serve organizational perspective within a namespace

– Is a flat space good enough?

• What can be imported into the namespace? – Can there be policies to automatically discover protectable elements and import them automagically?

• Questions• What is the relationship of this namespace to other potential elements or constructs within System Center/Tofino?

Page 11: Logical Architecture for Protection

11

Understanding Protection Namespaces• Providing a unified view across the Protection Namespace will prevent fragmented silo’d solutions which require a lot of

bookkeeping• Protection Name Space (Sliced By Node/Clusters)

– <Source Type, Source Name(URI), Protection Type, Target Recovery Service URI, Status>– Node A

• <Host, ‘A’, Windows Server Backup, Disk Z, Green>• <Host, ‘A’, Windows Server Backup, Network Share Z, Green >

– Node B• <Folder, ‘Folder xyz’, Snapshot Replication, Azure Storage, Green>

– Node C• <Volume, ‘Volume a’, Snapshot Replication, Enterprise

– Node D• <VM, ‘VM abc’, Hyper-V Replication-Hot, Target Host Node X, Green>• <VM, ‘VM cde’, Snapshot Replication, Azure Storage, Green>• <VM, ‘VM fgh’ Hyper-V Replication- Cold, Azure Storage,Green>

– Node E• <SQL DB, DB ‘bcd’ SQL Logging, Green>

– Cluster E• <VM, ‘VM efg’, Snapshot Replication, Yellow>• <VM, ‘VM cde’, Hyper-v Replication, Hot, Target Host Cluster G>

– Cluster F• <All VMs protected by Snapshot Replication, Status Green>

– SAN G• <SAN Volumes, Volume A to G, SAN Replication, Target SAN>

– Node H• <Volume,

Page 12: Logical Architecture for Protection

12

Protection Namespaces and Apps• Protection Name Space (Sliced By Application) <Protection Type, Schedule,

Status>– Application XYZ (Replace with real life example, Hrweb?)

• Web Tier– Node A– Node B– Node C

» < Windows Server Backup, Once in 15 days, 3 Recovery Points>• Middle Tier

– Node E– Node F

» <Windows Server Backup, Once in 15 days, 3 Recovery Points>• Data Tier

– Cluster VMs– < Hyper-V Replication, 5 minutes, 15 Recovery Points>

• Protection Name Space Chaining• <SQL DB A, SQL Synchronous Replication, SQL DB B>• <SQL DB b, SQL Asynchronous Replication, SQL DB C>

Page 13: Logical Architecture for Protection

13

Protection Namespace

• Single Server– Local Protection Namespace handled by the Local Recovery

Service– Should we allow Publishing

– Optionally, can publish to the Recovery Service or can publish to the Recovery Service(Azure)

• Do we really need a protection namespace?– Is it beneficial to the Application Owner/Administrator?– Is it beneficial to the Hoster (Fabric/Service Provider)?– Is it beneficial to the Fabric Administrator within a Data

Center?

Page 14: Logical Architecture for Protection

14

Application Migration and Protection Namespace

• What is the relationship of the Application Migration to Protection/Recovery Service

• Catalog/VSS Writer could aid in Application Discovery?– Unknown

• Commonalities– Application Migration Equals “IR + Simplified Recovery + Hydration”– Failover to cloud in case of Disaster Recovery is essentially equivalent of

Application Migration in terms of requirements around the ambience required by the application• A migrated application may need a VPN • An application failing over to the cloud may need a VPN to be configured

– In the long term, what do we as a Protection/Recovery Team need to do to ensure that the application can be protected appropriately as it is migrated

Page 15: Logical Architecture for Protection

15

Three segments

• Enterprise • Cloud – Azure Services

• Hybrid– A unified namespace across the enterprise and

cloud

Page 16: Logical Architecture for Protection

16

Management Tasks for the Protection Namespace• The success of the entire solution depends not only on plumbing but the ease with which data protection needs of a customer

can be met

• There are plenty of significant Management Tasks – Management of Protection NameSpace

• Protection Name Spaces– Contain Protected Elements

» Application, VM, Volumes, Collection of Volumes– Hierarchical, Nested

» By Location(Site), By Cluster, By Node» By Application

– Overlapping ?

– Management of Protection Policies– Simplifying Policies

» Stock Policies ?» Based on Intent and Calibration ?

– Provisioning for Replication– Driving the underlying replication

– Monitoring the Status of Protection– Given the policies and the namespaces, alert if things are not on schedule

– Orchestration for Recovery of Data• Indexing and Cataloging for Efficient Retrieval• Orchestration of Recovery

– Management for Disaster Recovery• Reserving Fabric• Testing of Fail Over and Fail back to Primary Site

– Orchestration of Hydration

– Management of Storage, Bandwidth, Fabric, Networking for all of the above

Page 17: Logical Architecture for Protection

17

VMs and Our Scenarios • Why do VMs require backup?

– VM corruption?– Guest Level backup Bs VM Level Backup

• However, it is important to understand where guest level replication makes more sense• Where does a combination of guest and vm level backup make sense

• VMs lend themselves more easily to migration– DR drives virtualization, as DR requires migration

• Definition #1 (For Azure?) – Cold Backup

• Medium to Low Recovery Time, Low Data Tolerance – Hot Backup

• Low Recovery Time + Low Data loss Tolerance

• Definition #2 (Applicable for Private Clouds)– Cold Backup

• Low Data Tolerance, Fabric is not reserved, Recovery may get long, however resources may be shared more effectively– Hot Backup

• Low Recovery Time, Low Data Tolerance, Fabric is reserved

Page 18: Logical Architecture for Protection

18

Physical Model (Enterprise Only)

Protection MetaData

Protected Data

Protection Service

Recovery Service

Hydration Service

Subscription Service

Protection MetaData

Site X

DAS

Protected Node

Protected Data(Possibly SAN, Possibly

NAS)

Archived Data

SAN

Fabric Mgmt Service

Cloud Service (Self Service)

Site Y

Protected DataProtected

Data

DR Service

Archival and Reporting Service

Protected Cluster

Recovery Cloud(Private)

Protected Node(NAS)

CSV Volumes

Protected Node(Hyper-V Host)

Storage Service

VM

Cloud Storage

Recovery Node (Hyper –V Host)

Application

Application

DAS

Protection Service

Recovery Service

Hydration Service

Policy/Monitoring Service

DR Service

Archival and Reporting Service

VMHot Backup

Cold Backup

Storage Service

App Migration

Server Backup

Cold Backup

Large Storage Backup

Catalog Service

Page 19: Logical Architecture for Protection

19

Cloud Storage

Azure Blobs – For Data

SQL Azure – For MetaData

Production Cloud(Private)

Physical Model (Direct to Cloud)Site X

DAS

Protected Node

SAN

Fabric Mgmt Service

Cloud Service (Self Service)

Site Y

Protected Cluster

Recovery Cloud(IaaS)

Protected Node(NAS)

CSV Volumes

Protected Node(Hyper-V Host)

VM

Cloud Storage

Recovery Node (Hyper –V Host)

Application

Application

DAS

Protection Service

Recovery Service

Hydration Service

Policy/Monitoring Service

DR Service

Archival and Reporting Service

VMHot Backup

Hydration

Storage Service

Large Storage Backup

Cold Backup

VMHydration

Page 20: Logical Architecture for Protection

20

Windows Server

Logical Architecture

Recovery Service (Web Tier)

Protection Service

Recovery Service (Data Tier)

Retrieval Service

Catalog Service

Disaster Recovery Service

Infrastructure Services(Subscription(tenancy), Transport, Jobs, Networking)

Storage Service(Protection

Data)

Replication Service

Replication Provider

Replication Provider

(Snapshot)

App Migration

Service

Recovery Service(Job Service)

Recovery Provider Recovery

Provider Off line

Recovery Providers

Local Recovery Service

Hydration Service

VSS Providers

Replication Provider(Hy

per –v)

VSS ProvidersVSS

Providers

Recovery Service Portal

Migration Service Portal

Data Post Processing

Roles

Storage Provider (Hyper-V R)

Storage Provider (Modified VHD Writer)

Xport Provider (File Write)

Catalog ProvidersCatalog

Providers

Page 21: Logical Architecture for Protection

21

Windows Server

Hyper-V Example(Enterprise DR)

Recovery Service (Web Tier)

Protection Service (Data

Tier)

Enterprise Storage Service(Protection

Data)

Replication Service

Replication Provider

Replication Provider

(Snapshot)

Protection Service(Job Service)Local Recovery Service

VSS Providers

Replication Provider(Hy

per –v)

VSS ProvidersVSS

Providers

Xport Provider (Hyper-V R VM)

Xport Provider (Modified VHD Writer)

Recovery Service Portal

Migration Service Portal

Windows Server

Local Recovery Service

Xport Provider (File Write)

Data Post Processing

Role

Catalog ProvidersCatalog

Providers

Page 22: Logical Architecture for Protection

22

Windows Server

Hyper-V R To Cloud Example

Recovery Service (Web Tier)

Protection Service (Data

Tier)

Azure Storage (Protection Data)

Replication Service

Replication Provider

Replication Provider

(Snapshot)

Recovery Service(Job Service)

Local Recovery Service

VSS Providers

Replication Provider(Hy

per –v)

VSS ProvidersVSS

Providers

Xport Provider (Hyper-V R Cloud)

Xport Provider (Modified VHD Writer)

Recovery Service Portal

Migration Service Portal

Xport Provider (File Write)

Catalog ProvidersCatalog

Providers

Hyper-V Data Post Processing

Role

Page 23: Logical Architecture for Protection

23

Site Protection

Page 24: Logical Architecture for Protection

24

Windows Server Backup Example (?)

Page 25: Logical Architecture for Protection

25

SQL Replication Example

Page 26: Logical Architecture for Protection

26

Application Migration Sharepoint Example

Page 27: Logical Architecture for Protection

27

Fine Grained Recovery From Hyper-V

Page 28: Logical Architecture for Protection

28

• What other examples do we need

Page 29: Logical Architecture for Protection

29

Capabilities of Components• Replication Provider

– Capability Profile• Supported Protected Element Types• Min Data Loss Tolerance Window• Max Data Loss Tolerance Window• Application Consistency Support

– Requirement profile• Require Off site Post Processing - Should this be a Xport Provider Requirement?

• Recovery Service Profile– Capability Profile (Are these per protected Element Type)

• Recovery Time• Recovery Points in Time• Retention Time• Encryption At Rest• Supported Offline Recovery Providers

• Storage/Xport Provider Profile– Capability Profile

• Client Side Encryption• Which Recovery Service are they affiliated to?

– Recovery Service in Cloud --- storage is in cloud– Recovery Service in Enterprise (DPM vnext) – storage is in Enterprise– Recovery Service in another node or Cluster – data is stored in storage local to that node/cluster

• Is there a notion of a Recovery Mgmt service that can provide to other Recovery Services for keeping their metadata and cataloging

Page 30: Logical Architecture for Protection

30

Major Components• Subscription Service : Create a Protection Name Space for a given customer• Protection Service : Allows creation of Protected Elements within a

protection namespace• Catalog Service : Provides for creation of a catalog for protected elements

for a given protection namespace• Recovery Service : Allows recovery of data for a protected element in a

protection name space• Hydration Service : Uses the recovery service to hydrate VMs in a private

cloud or to Azure • Job Service : Performs long running tasks submitted by the main services

and provides the infrastructure to monitor their progress• Data Post Processing Roles : A replication provider can register a data post

processing role to process data before it is stored

Page 31: Logical Architecture for Protection

31

Page 32: Logical Architecture for Protection

Components (Client Side)

• Recovery Service (Agent) : Manages/Orchestrates the processes in providing protection for a protected element and associates it with a recovery service.

• Replication Service : Provides the framework/platform for different replication providers to plugin

• Replication Provider• Xport Provider• Catalog Provider• VSS Writers

Page 33: Logical Architecture for Protection

33

Benefits

• Sets the Framework for a unified namespace for Backup/DR/HA

• Create a Hoster Friendly Stack– Hosters should want to deploy our stack in their datacenter

to provide value added offerings– Retain a model where Hosters can also easily leverage Azure

resources for their recovery scenarios– Need to understand what kinds of extensibilities they would

need beyond building their own portal– Over time we have a mostly unified codebase written to the

service model

Page 34: Logical Architecture for Protection

34

Roadmap/Next Steps

• Next steps– Build a roadmap, possibly multi-release, to get

there• Vteams to discuss and iterate over this

Page 35: Logical Architecture for Protection

35

Plausible Roadmap• V Next

– Build the Protection Mgmt service for the Azure segment (Protect on Azure)• Align with Tofino• Notion of application definition or service template

– How do we leverage and align with that?

– Evolve DPM to be the protection mgmt service for the Enterprises/Hosters • Adopt the OBS/Service architecture that supports multi-tenancy• Be the platform of choice for hosters to adopt to provide data protection services

to their customers• Ensure that it works seamlessly with OBS service to provide geo-protection using

the Azure Cloud Storage

• V Next Next– Figure out the evolution of components to serve the hybrid cloud or the

combined namespace

Page 36: Logical Architecture for Protection

36

The Data Replication Problem• Limiting Factors

– Throughput at the sending and the receiving side– Storage at the processing side

• Consists of the following parts– IR, Change Tracking and Data Movement– Catalog

• IR– Can we avoid/circumvent the problem by the use of published well known images?

• Change Tracking– Data must be self descriptive

• Data Movement– Channel

• Must Implement Push and Pull• Selection of EndPoint Listener

– Azure Replication Storage Service (cloud backup for VMs)– Private Cloud Replication Storage Service– Hyper-v Host Replication Listener (hot backup)

• Negotiate for compression• Encryption on Wire• Support Throttling

• Catalog

Page 37: Logical Architecture for Protection

37

A Layered Architecture ? Possible?Description Responsibilities

(Replication Layer)Responsibilities (Data Protection Layer)

Pros/Cons

Extensibility at Source Replication Layer Solely focused on change Tracking

1. Enable Change Tracking

2. Notification of handlers for safe transmission and persistence of data

1. Authentication2. Transmission

Format 3. Provide the acks

required as per the replication protocol

Extensibility at the listener

1. Change Tracking2. Provide a set of

listeners at the destination end of the channel

3. Authentication for the channel

4. Formats of Transmission

Two Models here :a) Data Protection Layer controls persistence b) Data Protection Layer preps the storage and the replication layers writes directly to the storage

Transmit change data to a Specified Listener

The entities at the two end of the channel agree to protocol

Page 38: Logical Architecture for Protection

38

Replication Provider Profile• Min Data Loss Tolerance Window• Max Data Loss Tolerance Window• Application Consistency Support• Recovery Time

– This depends more on the state in which the most up-to-date copy is kept

• Recovery Points in Time– To some extent this is not a capability of the provider but a limit

imposed by the storage or driven by requirements• Retention Time

– Not really a capability of the provider

Page 39: Logical Architecture for Protection

39

Requirements for Coupling Replication Providers and Storage Providers

Page 40: Logical Architecture for Protection

40

Basic Interaction • User tells the Mgmt Layer the source he needs to protect,

specifies the SLAs(Data Loss Tolerance, RPO, RTO and Retention Requirements)

• Mgmt Layer queries the Replication Service – Replication Service queries the replication providers which have

registered with it– Returns the provider – Mgmt Layer will provide the choice to the users

• Mgmt will ask the Replication Service to configure for replication with the user’s choice

• There is an initial handshake with the listener endpoint where queries for storage are negotiated

Page 41: Logical Architecture for Protection

41

Appendix

Page 42: Logical Architecture for Protection

42

Windows 8 Storage Investments• Windows Storage Pools : Storage virtualization over commodity disks but

providing advanced capabilities– Spaces : Virtual disks created off of storage pools

• Offloaded Data Transfer– Copy is offloaded to the intelligent storage array

• SMB Scaleout – SMB Direct : Clients need a NIC with RDMA capability– SMB Multipath : Adds robustness– SMB VSS for Remote File Shares

• CSV – – Available for application workloads, integrated with storage pools, thin provisioning,

smb scale out, support for fully featured VSS• Data De-duplication

– On server : how does dedup compare to our compression– On Host : DPM 2012 can handle deduped

Page 43: Logical Architecture for Protection

43

Replication Comparison• Hyper-v Replication

– Provides low data loss tolerance and write order consistency– Depends on MSCS clustering

• Not very resilient to primary host failure (Will require resync)• Not very resilient to replica Failure• Buffers will overflow, Doesn’t have log folding

– Doesn’t separate Staging of VMs from data storage • Replica Server may be receiving data for some VMs and at the same hosting a VM that has failed over

– How will it leverage storage deduplication?• Snapshotting and USN based File Tracking Mechanisms

– USN based file change tracking mechanisms coupled with volume snapshotting help extract the changes between two snapshots– File System Filter Driver helps tracks the file blocks that have changed– Resync’s are required if tracking is upset– More resilient to DPM server outage– Snapshotting on the receiving side is a blocker for scale --- how many concurrent vss snapshots can a server perform across different volumes?

• Chained Snapshotting helps utilize epoch based recovery • Each snapshot representing an epoch

• Data Loss Tolerance– For Hyper-v, scsi writes are copied into a buffered log pretty much continuously– For DPM, copy on write is enabled during the interval that buffered copies happen

• So,– How low can we squeeze the data loss tolerance with DPM?– How high can we squeeze the data loss window with hyper-v R? – We need instrumentation of data, ideally we should be able to compare the same workload– We can calibrate the workload and intent and chose….but then

• What happens when the workload changes?

Page 44: Logical Architecture for Protection

44

Catalog• Catalog – Historical, Tells what the high level contents of a

backup are – This essentially provides for browsability before full recovery is

undertaken• The meta data for the structure/high level contents of an application

structure is a part of the data associated with a certain recovery point however the catalog can help you with identification of which recovery point may have the data of interest

– Can the catalog information be handed down the VSS snapshot process• We expect the catalog to be tree structured • This can be huge for a large application• In such cases, can the applications be responsible for keeping an up-to-date

catalog?

Page 45: Logical Architecture for Protection

45

Replicated Content Format

• DPM stores the content uncompressed/unencrypted uses VSS snapshots as a mechanism to create point-in-time copies

• Hyper-v R supports VHD 2.0, data is not encrypted at rest but may be encrypted for transmission, data is not compressed

• OBS supports a modified VHD 1.0 (meta data is vhd 1.0, blocks are compressed and encrypted at rest)

• We are doing some tests on how much extraction, encryption and decompression add to the recovery time