designing for failurefiles.informatandm.com/uploads/2018/10/designing_for...why aws for disaster...

#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM

DESIGNING FORFAILUREDisaster Recovery using AWS

Karan DesaiSolutions Architect

AWS

speaker:~ $ whoami

> Solutions Architect at AWS since 2016

> Previously Akamai

> Previously Ericsson

> MS EE Virginia Tech

> San Francisco Bay Area resident

> Likes the cloud, airplanes, photography


What to Expect from the Session

• Disaster Recovery Concepts & Terminology• Why AWS for Disaster Recovery?• DR Design Options• Data Backup and Restore Strategies• DR Testing & Assurance• One More Thing…

A long time agoin a galaxy far, far away….

1986-04-26 01:23:04Begin experiment...

Recovery point

PanicSystems are not normal

Manually interpret signals and intervene

DisasterRecovery point

Data loss

“There must be an incredible amount of radiation here. We'll be lucky if we're all still alive in the morning.”

– Anatoli Zakharov, Fire Station 2 Chernobyl

DisasterRecovery point Recovery time

Data loss Down time

And yet…

Fukushima Daiichi11 March 2011

• Unplanned event causes coolant failure

• Uncontrolled fuel rod triggers meltdown event

• Uncontrolled release of steam triggers explosion

• Generator present, but failure occurs

• Unplanned event causes coolant failure

• Uncontrolled fuel rod triggers meltdown event

• Uncontrolled release of steam triggers explosion

• Generator present, but failure occurs

Chernobyl Fukushima Daiichi

Shared failures

Lesson learned?

Failure is not one thingIt’s many.

What are we planning for?

Why do I care about disaster recoveryif I am in the cloud?

“Everything fails,

all the time”- Werner Vogels

(CTO, Amazon.com)

• Over 1 million active customers per month across 190 countries

• 2,300 government agencies

• 7,000 educational institutions

Services can be deployed at Global – Regional – Availability Zone levels of reliability

18 worldwide regions, 55 Availability Zones.

4 new regions, 12 additional Availability Zones announced for 2019

Why AWS for Disaster Recovery?

Why AWS for Disaster Recovery?

Your operational DNA has to be crafted for reliability.

• Service SLAs between 99.9% and 100% availability

• Amazon S3 is designed for 99.999999999% durability

• AWS Availability Zones exist on isolated fault lines, flood plains, networks, and electrical grids to substantially reduce the chance of simultaneous failure.

Do not wait for disaster.

It’s not all or nothing.

Start somewhere and scale up

Disaster Recovery in the Cloud

Concepts & Terminology

Start Here: Business Continuity Requirements

▪ How quickly I need this service to be

recovered

▪ 1 minute? 15 minutes? 1 hour? 4 hours? 1

day?

▪ How much data loss can be tolerated?

▪ Zero data loss? 15 minutes out of date?

Down time

RPO RTO

Transactions Lost

Recovery Point Objective (RPO) Recovery Time Objective (RTO)

Ascending levels of DR options

Backup & Restore

Pilot Light

Warm Standby

Hot-Site

Backup of on-premises data to AWS to use in a DR event

Replicate data and minimal running services into AWS, ready to take over and flare up

Replicate data and services into AWS ready to take over

Replicated and load balanced environments that are both actively taking production traffic

RPO

aRTO

COST

24 hours 24 hours

$

RPO

aRTO

COST

12 hours 4 hours

$$

RPO

aRTO

COST

1-4 hours 15 min

$$$

RPO

aRTO

COST

<15 min 0-5 min

$$$

Business continuity

begins

Un-interrupted

Business continuity

DR Terminology Map

ELB/Appliance

EC2/Auto Scaling

Route 53

Load Balancers

Web/App Servers

Your Data Centers

DNS

Amazon RDS

Security Groups / ACL

Availability Zones / VPC

Multi-regionGeographical Redundancy

Data Centers

Firewall

Database Servers

Disaster Recovery Approaches

Backup and Restore

On-premises Active Production www.example.com

Corporate data center AWS region

AWS DR failover

AppServers

DB

Server

VPN Connection

Storage GatewayiSCSI

BackupSystem

S3 Bucket

Glacier / Archive

WebServers Internet traffic

S3 Bucket

1TB Data Volume

Backup and Restore Architecture

• Suitable for• Solutions that can sustain higher technical debt

• Lower business critical nature

• Low cost DR option

• Leverage existing investments in• De-duplication

• Compression

• WAN Acceleration

Backup and Restore Use-Case

Pilot light

Secondarydatabase

server

Pilot light – Preparationwww.example.com

Data mirroring replication

Not running

Pilot light system

Reverse proxy/ caching server

Datavolume

Applicationserver

Corporate data center


Application server

MasterDatabase

server

Databaseserver

Pilot light – Recoverywww.example.com

Start in minutes

Add additional capacity, if needed


Datavolume

Applicationserver



Application server

MasterDatabase

server

Suitable for:

• Meeting lower RTO & RPO requirements

• Services that can tolerate some downtime

• Mid-range cost option for DR

Pilot Light Use-Cases

Warm Standby

Warm standby –Preparation

Mirroring /replication

Application data source cut over

Elastic load

balancerActiveNot active for production

traffic

Route 53

www.example.com

Scaled down standbyCorporate data center

Datavolume

Applicationserver

Subordinatedatabase

server


AWS region


Application server

MasterDatabase

server

Warm standby –Recovery

Elastic loadbalancerActive

Route 53

www.example.com

Scaled-upproduction


Datavolume

Applicationserver

Databaseserver


AWS region


Application Server

MasterDatabase

server

Suitable for:

• Full failover if needed during a disaster

• Solutions that require RTO & RPO in minutes

• Core business-critical functions

Higher cost than pilot light

Warm Standby Use-Cases

Hot Site

Hot site –Preparation



Elastic loadbalancer

ActiveRoute 53

www.example.com


Datavolume

Applicationserver

Subordinate database

server


AWS region


Application server

MasterDatabase

server

Active

Hot site –Recovery


Route 53

www.example.com


Datavolume

Applicationserver

Databaseserver


AWS region


Application server

MasterDatabase

server

Active

Scaled upfor production use

Suitable for:

• Most important business-critical functions

• Applications that cannot afford any downtime

• RTO and RPO in seconds

Highest cost option of Disaster Recovery

Hot Site Use-Cases

What about my data?

Use case 1 Basic backup and recovery

• $ aws s3 sync /backups s3://mybucket

// Back up and sync the backup folder

• $ aws s3 sync /backups s3://mybucket --delete

// Like the preceding, but now delete files not present

• $ aws s3 sync /backups s3://mybucket --delete –storage-class STANDARD_IA

// Like the preceding, but now leverages S3 Infrequent Access

AWS CLI-based backup, manual DR failover

What does it look like?

S3 Amazon Glacier

S3 bucket

Remote location

/mybucketS3

STANDARD_IA

1

2

Lifecycle policy

What does a recovery look like?

Failover Remote location

2

AWS DR Region

Amazon EC2

S3 Amazon Glacier

S3 bucket

/mybucketS3 STANDARD_IA

1

Lifecycle policy

What would it cost?

S3 STANDARD_IA S3 Amazon Glacier

$ 0.0125/GB $ 0.023/GB $ 0.004/GB

Service Cost

S3 - 100 GB images $2.30

S3–Infrequent Access - 500 GB of data $6.25

Amazon Glacier – 1 TB archives $4.10

Total $12.65/month

Prices shown are for us-east-1 region as of Oct 2018 and subject to change over time.

Data size: 100 GB of images, 500 GB of older data, 1 TB of archives

Use case 2 Large data archive and recovery

Large data set – Backup using AWS Snowball

AWS cloud


NGS

On-premisescompute /cluster

Sequence data

Flowcell-ID

Amazon Glacier

2 3

AWS Snowball device

AWS CLI

1

AWS Snowball

Large data set - Backup using Volume Gateway

AWS cloud


NGS

On-premisesCompute / cluster

Virtual server

ISCSI

Cached volume1

2

virtual tape library

AWS Storage Gateway

Amazon Glacier

AmazonS3

AWS Storage Gateway Amazon

S3

Large data set – Backup using File Gateway


NGS

On-premisesCompute / cluster

FileGateway

NFS

AWS cloud -US-West-2

Amazon S3

S3 bucket

Lifecycle policy

AWS cloud US-East-1

Amazon S3

S3 bucket

File Gateway VM

Large data set – Recovery using AWS Snowball

AWS DR Region

Sequence data

Flowcell-ID

Amazon Glacier

Corporate DR facility

Server infrastructure

1

AWS Snowball

S3 VPC endpoint

AWS DR Region

2

Amazon EC2

Large data set – Recovery using Volume Gateway

AWS DR Region


NGS

On-premisesCompute/cluster

AWS Storage Gateway

Virtual server

ISCSI

Cached volume

1

Amazon Glacier

Amazon S3

instance

2

AWS DR Region

EBS snapshot

virtual tape library

AWS DR Region

instance

AMI

Amazon EBS

Large data set – Recovery using File Gateway

AWS DR Region

Amazon S3

S3 bucket


NGS

On-premisesCompute/cluster

FileGateway

AWS DR Region

Amazon EC2

1

2

3

S3 endpoint

NFS

File Gateway VM

What would it cost? – with Gateways

File Storage Volume Storage VTL - Archived

$ 0.023/GB $0.023/GB $ 0.004/GB

Service Cost

File Gateway - 10 TB $235.40

Storage Gateway - 32 TB $736

Storage Gateway VTL - 250TB $1,000

Total $1,971.4/mo

Prices shown are for us-west-2 region as of Oct 2018 and subject to change over time.

Data size: 10 TB of files, 32 TB of storage volume, 250 TB of tapes

What would it cost? – with Snowball

S3 Snowball -edge Amazon Glacier

$ 0.023/GB $300/100TB $ 0.004/GB

Service Cost

AWS SnowBall * 10 $3,000.00

Amazon Glacier archive 1 PB $4,194.31

Total $ 7,194.31$4,194.31 /month

Prices shown are for us-west-2 region as of Oct 2018 and subject to change over time.

Data size: 1 PB of data, 1 PB of archives

What if I have even more data?

Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.

- Andrew S. Tanenbaum

Use case 3Multi site replication and failover


AWSDirect Connect

VPN

us-east-1

us-west-2

Server

Server

Availability Zone Availability Zone

Failback

Server

Multisite failover

Customer Gateway

users Equinix DA1


AWSDirect Connect

VPN

us-east-1

us-west-2

Failback

AWS CloudFormation

Server

Availability Zone Availability Zone

Server

Multisite failover

Server

users Customer Gateway

Equinix DA1

What would it cost? (30 days) - Remote SiteVPC VPN EC2 *

(m4.xlarge)1 GbDirect Connect

EBS Region data transfer fee

$ 0.05/hr $ 0.20/hr $ 0.30/hr $ 0.10/GB $ 0.02/GB

Service Cost

1 GB Direct Connect $219.60

VPN Fallback Connection $36.00

(2) EC2 instances – 1 in each AZ $292.80

(2) EBS 60 GB volumes $12.00

(1) AMI copy to us-west-2 $1.20

Total $561.60*US-West-2, Amazon Linux AMI

Use case 4:All in on AWS – Planning for Amazon S3 data loss

“I’m worried about losing data from S3!”

• Amazon S3 is built for 11 9s of durability• If you store 10,000 objects, you can on average expect to incur a loss of

a single object once every 10,000,000 years.

• Amazon S3 supports cross region replication

• Amazon S3 supports versioning

• Amazon S3 supports MFA delete

• IAM roles can also be used to limit access to S3

AmazonS3

Use case 5:All in on AWS – Planning for Database failover

RDS Database

• Create Multi-AZ deployments• Data synchronously replicated to another Availability Zone

• Set up automatic backup/snapshots

• Use Cross-region Read Replicas for MySQL, PostgreSQL, MariaDB

• Use Amazon Aurora for MySQL and PostgreSQL• Distributed, fault-tolerant, self-healing storage system

• Low-latency read replicas

• Point-in-time recovery

• Continuous backup to Amazon S3

• Replication across three Availability Zones.

AmazonRDS

Database Migration Service (DMS)

• Continuous or one time DB replication to EC2 or RDS

• Leverage DMS to replicate your database to AWS or even

change your schema from one engine to another

AWS DMS

Source Database Target Database on Amazon RDS

Oracle Database Amazon Aurora, MySQL, PostgreSQL, MariaDB

Oracle Data Warehouse Amazon Redshift

Microsoft SQL ServerAmazon Aurora, Amazon Redshift, MySQL, PostgreSQL,

MariaDB

What about third party support?

Amazon BC/DR partner ecosystem (sample)

• Solutions that utilize AWS to enable recovery strategies

• Focused on RTO and RPO requirements

• Full suite of both cold and warm BC/DR solutions

Disaster Recovery Testing & Assurance

Test continuously and constantly

• Regularly execute tests in stable, production & production-like test environments

• Set up Infrastructure as Code

• CI/CD Test in Infrastructure Build Pipeline

• Playbook to follow documented procedures

Test your DR plan before disaster strikes

Warm Standby –Testing




Active Not active for production trafficRoute 53

www.example.com

Scaled down standby


Datavolume

Applicationserver

Subordinatedatabase

server


AWS region


Application server

Master

Database server





ActiveNot active for production

trafficRoute 53

www.example.com

Scaled down standby


Datavolume

Applicationserver

Subordinatedatabase

server


AWS region


Application server

Master

Database server






traffic

Route 53

www.example.com

Scaled down standby


Datavolume

Applicationserver

Subordinatedatabase

server


AWS region


Application server

Master

Database server






trafficRoute 53

www.example.com

Scaled down standby


Datavolume

Applicationserver

Subordinatedatabase

server


AWS region


Application server

Master

Database server

aws rds reboot-db-instance --db-instance-identifier dbInstanceID --force-failover

https://github.com/Netflix/chaosmonkey

Unleash the Simian Army!

How easy can I make my DR?

“Alexa, fail over my data center”

#Alexafailover

https://failover.karandemo.com/

ELBin SIN region

Route 53

failover.karandemo.com

Singapore Region

ApplicationServers

(EC2 - ASG)

PrimaryDatabase(Aurora MySQL)

Web Servers

(EC2 - ASG)

Sydney Region

Voice Activated Failover with Alexa

Web Servers(EC2 –ASG)

ApplicationServers(EC2 –ASG)

DatabaseRead-

Replica(Aurora MySQL)

Lambdafunction

AlexaSkill

SNSTopic

ELBin SYD region

Alexa enableddevice

ELBin SIN region

Route 53

Singapore Region

ApplicationServers

(EC2 - ASG)

PrimaryDatabase(Aurora MySQL)

Web Servers

(EC2 - ASG)

Sydney Region

Web Servers(EC2 –ASG)

ApplicationServers(EC2 –ASG)

DatabaseRead-Replica

(Aurora MySQL)

Lambdafunction

SNSTopic

ELBin SYD region

Alexa enableddevice

AlexaSkill

failover.karandemo.com

route53.changeResourceRecordSets()

SNS.publish()

rds.failoverDBCluster()

Putting it all together

Lessons from history

Plan for more than just what you expect to happen


Test your execution plan before you think you can implement it


Knowledge is critical. Know how to interpret an alarm on events.

Words of advice

People generally don’t do well

under pressure.

Relying on manual intervention to trigger DR plan is invitation for trouble.

Words of advice

• Automate as much as you can

• Table-top exercises can really help you understand roles and responsibility

• Not all services have to require the same RTO/RPO

• If you don’t have a runbook, it’s time to make one

• If you have one, have you tested it?

Seriously, automate as much as you can ahead of time!

Further Reading:

https://aws.amazon.com/disaster-recovery/

Whitepaper: Using AWS for Disaster Recovery

https://media.amazonwebservices.com/AWS_Disaster_Recovery.pdf


Thank You!

Karan DesaiSolutions ArchitectAWS

Email: [email protected]

Twitter: @somecloudguy

DESIGNING FOR FAILUREDisaster Recovery Using AWS

designing for failurefiles.informatandm.com/uploads/2018/10/designing_for...why aws for disaster...

Documents