designing for failurefiles.informatandm.com/uploads/2018/10/designing_for...why aws for disaster...
TRANSCRIPT
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
DESIGNING FORFAILUREDisaster Recovery using AWS
Karan DesaiSolutions Architect
AWS
speaker:~ $ whoami
> Solutions Architect at AWS since 2016
> Previously Akamai
> Previously Ericsson
> MS EE Virginia Tech
> San Francisco Bay Area resident
> Likes the cloud, airplanes, photography
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
What to Expect from the Session
• Disaster Recovery Concepts & Terminology• Why AWS for Disaster Recovery?• DR Design Options• Data Backup and Restore Strategies• DR Testing & Assurance• One More Thing…
A long time agoin a galaxy far, far away….
1986-04-26 01:23:04Begin experiment...
Recovery point
PanicSystems are not normal
Manually interpret signals and intervene
DisasterRecovery point
Data loss
“There must be an incredible amount of radiation here. We'll be lucky if we're all still alive in the morning.”
– Anatoli Zakharov, Fire Station 2 Chernobyl
DisasterRecovery point Recovery time
Data loss Down time
And yet…
Fukushima Daiichi11 March 2011
• Unplanned event causes coolant failure
• Uncontrolled fuel rod triggers meltdown event
• Uncontrolled release of steam triggers explosion
• Generator present, but failure occurs
• Unplanned event causes coolant failure
• Uncontrolled fuel rod triggers meltdown event
• Uncontrolled release of steam triggers explosion
• Generator present, but failure occurs
Chernobyl Fukushima Daiichi
Shared failures
Lesson learned?
Failure is not one thingIt’s many.
What are we planning for?
Why do I care about disaster recoveryif I am in the cloud?
“Everything fails,
all the time”- Werner Vogels
(CTO, Amazon.com)
• Over 1 million active customers per month across 190 countries
• 2,300 government agencies
• 7,000 educational institutions
Services can be deployed at Global – Regional – Availability Zone levels of reliability
18 worldwide regions, 55 Availability Zones.
4 new regions, 12 additional Availability Zones announced for 2019
Why AWS for Disaster Recovery?
Why AWS for Disaster Recovery?
Your operational DNA has to be crafted for reliability.
• Service SLAs between 99.9% and 100% availability
• Amazon S3 is designed for 99.999999999% durability
• AWS Availability Zones exist on isolated fault lines, flood plains, networks, and electrical grids to substantially reduce the chance of simultaneous failure.
Do not wait for disaster.
It’s not all or nothing.
Start somewhere and scale up
Disaster Recovery in the Cloud
Concepts & Terminology
Start Here: Business Continuity Requirements
▪ How quickly I need this service to be
recovered
▪ 1 minute? 15 minutes? 1 hour? 4 hours? 1
day?
▪ How much data loss can be tolerated?
▪ Zero data loss? 15 minutes out of date?
Down time
RPO RTO
Transactions Lost
Recovery Point Objective (RPO) Recovery Time Objective (RTO)
Ascending levels of DR options
Backup & Restore
Pilot Light
Warm Standby
Hot-Site
Backup of on-premises data to AWS to use in a DR event
Replicate data and minimal running services into AWS, ready to take over and flare up
Replicate data and services into AWS ready to take over
Replicated and load balanced environments that are both actively taking production traffic
RPO
aRTO
COST
24 hours 24 hours
$
RPO
aRTO
COST
12 hours 4 hours
$$
RPO
aRTO
COST
1-4 hours 15 min
$$$
RPO
aRTO
COST
<15 min 0-5 min
$$$
Business continuity
begins
Un-interrupted
Business continuity
DR Terminology Map
ELB/Appliance
EC2/Auto Scaling
Route 53
Load Balancers
Web/App Servers
Your Data Centers
DNS
Amazon RDS
Security Groups / ACL
Availability Zones / VPC
Multi-regionGeographical Redundancy
Data Centers
Firewall
Database Servers
Disaster Recovery Approaches
Backup and Restore
On-premises Active Production www.example.com
Corporate data center AWS region
AWS DR failover
AppServers
DB
Server
VPN Connection
Storage GatewayiSCSI
BackupSystem
S3 Bucket
Glacier / Archive
WebServers Internet traffic
S3 Bucket
1TB Data Volume
Backup and Restore Architecture
• Suitable for• Solutions that can sustain higher technical debt
• Lower business critical nature
• Low cost DR option
• Leverage existing investments in• De-duplication
• Compression
• WAN Acceleration
Backup and Restore Use-Case
Pilot light
Secondarydatabase
server
Pilot light – Preparationwww.example.com
Data mirroring replication
Not running
Pilot light system
Reverse proxy/ caching server
Datavolume
Applicationserver
Corporate data center
Reverse proxy/ caching server
Application server
MasterDatabase
server
Databaseserver
Pilot light – Recoverywww.example.com
Start in minutes
Add additional capacity, if needed
Reverse proxy/ caching server
Datavolume
Applicationserver
Corporate data center
Reverse proxy/ caching server
Application server
MasterDatabase
server
Suitable for:
• Meeting lower RTO & RPO requirements
• Services that can tolerate some downtime
• Mid-range cost option for DR
Pilot Light Use-Cases
Warm Standby
Warm standby –Preparation
Mirroring /replication
Application data source cut over
Elastic load
balancerActiveNot active for production
traffic
Route 53
www.example.com
Scaled down standbyCorporate data center
Datavolume
Applicationserver
Subordinatedatabase
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
MasterDatabase
server
Warm standby –Recovery
Elastic loadbalancerActive
Route 53
www.example.com
Scaled-upproduction
Corporate data center
Datavolume
Applicationserver
Databaseserver
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application Server
MasterDatabase
server
Suitable for:
• Full failover if needed during a disaster
• Solutions that require RTO & RPO in minutes
• Core business-critical functions
Higher cost than pilot light
Warm Standby Use-Cases
Hot Site
Hot site –Preparation
Mirroring /replication
Application data source cut over
Elastic loadbalancer
ActiveRoute 53
www.example.com
Corporate data center
Datavolume
Applicationserver
Subordinate database
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
MasterDatabase
server
Active
Hot site –Recovery
Elastic loadbalancer
Route 53
www.example.com
Corporate data center
Datavolume
Applicationserver
Databaseserver
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
MasterDatabase
server
Active
Scaled upfor production use
Suitable for:
• Most important business-critical functions
• Applications that cannot afford any downtime
• RTO and RPO in seconds
Highest cost option of Disaster Recovery
Hot Site Use-Cases
What about my data?
Use case 1 Basic backup and recovery
• $ aws s3 sync /backups s3://mybucket
// Back up and sync the backup folder
• $ aws s3 sync /backups s3://mybucket --delete
// Like the preceding, but now delete files not present
• $ aws s3 sync /backups s3://mybucket --delete –storage-class STANDARD_IA
// Like the preceding, but now leverages S3 Infrequent Access
AWS CLI-based backup, manual DR failover
What does it look like?
S3 Amazon Glacier
S3 bucket
Remote location
/mybucketS3
STANDARD_IA
1
2
Lifecycle policy
What does a recovery look like?
Failover Remote location
2
AWS DR Region
Amazon EC2
S3 Amazon Glacier
S3 bucket
/mybucketS3 STANDARD_IA
1
Lifecycle policy
What would it cost?
S3 STANDARD_IA S3 Amazon Glacier
$ 0.0125/GB $ 0.023/GB $ 0.004/GB
Service Cost
S3 - 100 GB images $2.30
S3–Infrequent Access - 500 GB of data $6.25
Amazon Glacier – 1 TB archives $4.10
Total $12.65/month
Prices shown are for us-east-1 region as of Oct 2018 and subject to change over time.
Data size: 100 GB of images, 500 GB of older data, 1 TB of archives
Use case 2 Large data archive and recovery
Large data set – Backup using AWS Snowball
AWS cloud
Corporate data center
NGS
On-premisescompute /cluster
Sequence data
Flowcell-ID
Amazon Glacier
2 3
AWS Snowball device
AWS CLI
1
AWS Snowball
Large data set - Backup using Volume Gateway
AWS cloud
Corporate data center
NGS
On-premisesCompute / cluster
Virtual server
ISCSI
Cached volume1
2
virtual tape library
AWS Storage Gateway
Amazon Glacier
AmazonS3
AWS Storage Gateway Amazon
S3
Large data set – Backup using File Gateway
Corporate data center
NGS
On-premisesCompute / cluster
FileGateway
NFS
AWS cloud -US-West-2
Amazon S3
S3 bucket
Lifecycle policy
AWS cloud US-East-1
Amazon S3
S3 bucket
File Gateway VM
Large data set – Recovery using AWS Snowball
AWS DR Region
Sequence data
Flowcell-ID
Amazon Glacier
Corporate DR facility
Server infrastructure
1
AWS Snowball
S3 VPC endpoint
AWS DR Region
2
Amazon EC2
Large data set – Recovery using Volume Gateway
AWS DR Region
Corporate data center
NGS
On-premisesCompute/cluster
AWS Storage Gateway
Virtual server
ISCSI
Cached volume
1
Amazon Glacier
Amazon S3
instance
2
AWS DR Region
EBS snapshot
virtual tape library
AWS DR Region
instance
AMI
Amazon EBS
Large data set – Recovery using File Gateway
AWS DR Region
Amazon S3
S3 bucket
Corporate data center
NGS
On-premisesCompute/cluster
FileGateway
AWS DR Region
Amazon EC2
1
2
3
S3 endpoint
NFS
File Gateway VM
What would it cost? – with Gateways
File Storage Volume Storage VTL - Archived
$ 0.023/GB $0.023/GB $ 0.004/GB
Service Cost
File Gateway - 10 TB $235.40
Storage Gateway - 32 TB $736
Storage Gateway VTL - 250TB $1,000
Total $1,971.4/mo
Prices shown are for us-west-2 region as of Oct 2018 and subject to change over time.
Data size: 10 TB of files, 32 TB of storage volume, 250 TB of tapes
What would it cost? – with Snowball
S3 Snowball -edge Amazon Glacier
$ 0.023/GB $300/100TB $ 0.004/GB
Service Cost
AWS SnowBall * 10 $3,000.00
Amazon Glacier archive 1 PB $4,194.31
Total $ 7,194.31$4,194.31 /month
Prices shown are for us-west-2 region as of Oct 2018 and subject to change over time.
Data size: 1 PB of data, 1 PB of archives
What if I have even more data?
Never underestimate the bandwidth of a station wagon full of tapes hurtling down the highway.
- Andrew S. Tanenbaum
Use case 3Multi site replication and failover
Corporate data center
AWSDirect Connect
VPN
us-east-1
us-west-2
Server
Server
Availability Zone Availability Zone
Failback
Server
Multisite failover
Customer Gateway
users Equinix DA1
Corporate data center
AWSDirect Connect
VPN
us-east-1
us-west-2
Failback
AWS CloudFormation
Server
Availability Zone Availability Zone
Server
Multisite failover
Server
users Customer Gateway
Equinix DA1
What would it cost? (30 days) - Remote SiteVPC VPN EC2 *
(m4.xlarge)1 GbDirect Connect
EBS Region data transfer fee
$ 0.05/hr $ 0.20/hr $ 0.30/hr $ 0.10/GB $ 0.02/GB
Service Cost
1 GB Direct Connect $219.60
VPN Fallback Connection $36.00
(2) EC2 instances – 1 in each AZ $292.80
(2) EBS 60 GB volumes $12.00
(1) AMI copy to us-west-2 $1.20
Total $561.60*US-West-2, Amazon Linux AMI
Use case 4:All in on AWS – Planning for Amazon S3 data loss
“I’m worried about losing data from S3!”
• Amazon S3 is built for 11 9s of durability• If you store 10,000 objects, you can on average expect to incur a loss of
a single object once every 10,000,000 years.
• Amazon S3 supports cross region replication
• Amazon S3 supports versioning
• Amazon S3 supports MFA delete
• IAM roles can also be used to limit access to S3
AmazonS3
Use case 5:All in on AWS – Planning for Database failover
RDS Database
• Create Multi-AZ deployments• Data synchronously replicated to another Availability Zone
• Set up automatic backup/snapshots
• Use Cross-region Read Replicas for MySQL, PostgreSQL, MariaDB
• Use Amazon Aurora for MySQL and PostgreSQL• Distributed, fault-tolerant, self-healing storage system
• Low-latency read replicas
• Point-in-time recovery
• Continuous backup to Amazon S3
• Replication across three Availability Zones.
AmazonRDS
Database Migration Service (DMS)
• Continuous or one time DB replication to EC2 or RDS
• Leverage DMS to replicate your database to AWS or even
change your schema from one engine to another
AWS DMS
Source Database Target Database on Amazon RDS
Oracle Database Amazon Aurora, MySQL, PostgreSQL, MariaDB
Oracle Data Warehouse Amazon Redshift
Microsoft SQL ServerAmazon Aurora, Amazon Redshift, MySQL, PostgreSQL,
MariaDB
What about third party support?
Amazon BC/DR partner ecosystem (sample)
• Solutions that utilize AWS to enable recovery strategies
• Focused on RTO and RPO requirements
• Full suite of both cold and warm BC/DR solutions
Disaster Recovery Testing & Assurance
Test continuously and constantly
• Regularly execute tests in stable, production & production-like test environments
• Set up Infrastructure as Code
• CI/CD Test in Infrastructure Build Pipeline
• Playbook to follow documented procedures
Test your DR plan before disaster strikes
Warm Standby –Testing
Mirroring /replication
Application data source cut over
Elastic loadbalancer
Active Not active for production trafficRoute 53
www.example.com
Scaled down standby
Corporate data center
Datavolume
Applicationserver
Subordinatedatabase
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
Master
Database server
Warm Standby –Testing
Mirroring /replication
Application data source cut over
Elastic loadbalancer
ActiveNot active for production
trafficRoute 53
www.example.com
Scaled down standby
Corporate data center
Datavolume
Applicationserver
Subordinatedatabase
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
Master
Database server
Warm Standby –Testing
Mirroring /replication
Application data source cut over
Elastic loadbalancer
ActiveNot active for production
traffic
Route 53
www.example.com
Scaled down standby
Corporate data center
Datavolume
Applicationserver
Subordinatedatabase
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
Master
Database server
Warm Standby –Testing
Mirroring /replication
Application data source cut over
Elastic loadbalancer
ActiveNot active for production
trafficRoute 53
www.example.com
Scaled down standby
Corporate data center
Datavolume
Applicationserver
Subordinatedatabase
server
Reverse proxy/ caching server
AWS region
Reverse proxy/ caching server
Application server
Master
Database server
aws rds reboot-db-instance --db-instance-identifier dbInstanceID --force-failover
https://github.com/Netflix/chaosmonkey
Unleash the Simian Army!
How easy can I make my DR?
“Alexa, fail over my data center”
#Alexafailover
https://failover.karandemo.com/
ELBin SIN region
Route 53
failover.karandemo.com
Singapore Region
ApplicationServers
(EC2 - ASG)
PrimaryDatabase(Aurora MySQL)
Web Servers
(EC2 - ASG)
Sydney Region
Voice Activated Failover with Alexa
Web Servers(EC2 –ASG)
ApplicationServers(EC2 –ASG)
DatabaseRead-
Replica(Aurora MySQL)
Lambdafunction
AlexaSkill
SNSTopic
ELBin SYD region
Alexa enableddevice
ELBin SIN region
Route 53
Singapore Region
ApplicationServers
(EC2 - ASG)
PrimaryDatabase(Aurora MySQL)
Web Servers
(EC2 - ASG)
Sydney Region
Web Servers(EC2 –ASG)
ApplicationServers(EC2 –ASG)
DatabaseRead-Replica
(Aurora MySQL)
Lambdafunction
SNSTopic
ELBin SYD region
Alexa enableddevice
AlexaSkill
failover.karandemo.com
route53.changeResourceRecordSets()
SNS.publish()
rds.failoverDBCluster()
Putting it all together
Lessons from history
Plan for more than just what you expect to happen
Lessons from history
Test your execution plan before you think you can implement it
Lessons from history
Knowledge is critical. Know how to interpret an alarm on events.
Words of advice
People generally don’t do well
under pressure.
Relying on manual intervention to trigger DR plan is invitation for trouble.
Words of advice
• Automate as much as you can
• Table-top exercises can really help you understand roles and responsibility
• Not all services have to require the same RTO/RPO
• If you don’t have a runbook, it’s time to make one
• If you have one, have you tested it?
Seriously, automate as much as you can ahead of time!
Further Reading:
https://aws.amazon.com/disaster-recovery/
Whitepaper: Using AWS for Disaster Recovery
https://media.amazonwebservices.com/AWS_Disaster_Recovery.pdf
#ITDEVCONNECTIONS | ITDEVCONNECTIONS.COM
Thank You!
Karan DesaiSolutions ArchitectAWS
Email: [email protected]
Twitter: @somecloudguy
DESIGNING FOR FAILUREDisaster Recovery Using AWS