how netflix leverages multiple regions to increase availability (arc305) | aws re:invent 2013

Download How Netflix Leverages Multiple Regions to Increase Availability (ARC305) | AWS re:Invent 2013

If you can't read please download the document

Upload: amazon-web-services

Post on 06-May-2015

1.381 views

Category:

Technology


2 download

DESCRIPTION

Netflix sought to increase availability beyond the capabilities of a single region. How is that possible, you ask? We walk you through the journey that Netflix underwent to redesign and operate our service to achieve this lofty goal. Using the principles of isolation and redundancy, our destination is a fully redundant active-active deployment architecture where end users can be served out of multiple AWS regions. If one region fails, another can quickly take its place. Along the way, we'll explore our Isthmus architecture, a stepping stone toward full active-active deployment where only the edge services are redundantly deployed to multiple regions. We'll cover real-world challenges we had to overcome, like near-real-time data replication, and the operational tools and best-practices we needed to develop to make it a success. Discover a whole new breed of monkeys we created to test multiregional resiliency scenarios.

TRANSCRIPT

  • 1.How Netflix Leverages Multiple Regions to Increase Availability: Isthmus and Active-Active Case Study Ruslan Meshenberg November 13, 2013 2013 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.

2. Failure 3. Assumptions Everything Is BrokenHardware Will Fail Slowly Changing Large Scale Telcos ScaleEnterprise IT Rapid Change Large ScaleRapid Change Small ScaleWeb Scale StartupsSlowly Changing Small ScaleEverything WorksSoftware Will Fail Speed 4. Incidents Impact and Mitigation Public relations media impact High customer service calls Affects AB test resultsPR X IncidentsY incidents mitigated by Active Active, game day practicingCS XX Incidents Metrics Impact Feature Disable XXX IncidentsYY incidents mitigated by better tools and practices YYY incidents mitigated by better data taggingNo Impact Fast Retry or Automated Failover XXXX Incidents 5. Does an Instance Fail? It can, plan for it Bad code / configuration pushes Latent issues Hardware failure Test with Chaos Monkey 6. Does a Zone Fail? Rarely, but happened before Routing issues DC-specific issues App-specific issues within a zone Test with Chaos Gorilla 7. Does a Region Fail? Full region unlikely, very rare Individual Services can fail region-wide Most likely, a region-wide configuration issue Test with Chaos Kong 8. Everything Fails Eventually The good news is you can do something about it Keep your services running by embracing isolation and redundancy 9. Cloud Native A New Engineering Challenge Construct a highly agile and highly available service from ephemeral and assumed broken components 10. Isolation Changes in one region should not affect others Regional outage should not affect others Network partitioning between regions should not affect functionality / operations 11. Redundancy Make more than one (of pretty much everything) Specifically, distribute services across Availability Zones and regions 12. History: X-mas Eve 2012 Netflix multi-hour outage US-East1 regional Elastic Load Balancing issue ...data was deleted by a maintenance process that was inadvertently run against the production ELB state data The ELB issue affected less than 7% of running ELBs and prevented other ELBs from scaling. 13. Isthmus Normal OperationUS-East ELBUS-West-2 ELB ELB Zuul Infrastructure*Zone ATunnelZone BZone CCassandra ReplicasCassandra ReplicasCassandra Replicas 14. Isthmus FailoverUS-East ELBUS-West-2 ELB ELB Zuul Infrastructure*Zone ATunnelZone BZone CCassandra ReplicasCassandra ReplicasCassandra Replicas 15. Isthmus 16. Zuul OverviewElastic Load BalancingElastic Load BalancingElastic Load Balancing 17. Zuul Details 18. Denominator Abstracting the DNS LayerAmazon Route 53DynECT DNSUltraDNSDenominatorRegional Load Balancers ELBs Zone A Cassandra ReplicasZone B Cassandra ReplicasZone C Cassandra ReplicasZone AZone BZone CCassandra ReplicasCassandra ReplicasCassandra Replicas 19. Isthmus Only for Elastic Load Balancing Failures Other services may fail region-wide Not worthwhile to develop one-offs for each one 20. Active-Active Full Regional ResiliencyRegional Load BalancersRegional Load BalancersZone AZone BZone CZone AZone BZone CCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra Replicas 21. Active-Active FailoverRegional Load BalancersRegional Load BalancersZone AZone BZone CZone AZone BZone CCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra Replicas 22. Active-Active Architecture 23. Separating the Data Eventual Consistency 24 region Cassandra clusters Eventual consistency != hopeful consistency 24. Highly Available NoSQL Storage A highly scalable, available, and durable deployment pattern based on Apache Cassandra 25. Benchmarking Global Cassandra Write intensive test of cross-region replication capacity 16 x hi1.4xlarge SSD nodes per zone = 96 total 192 TB of SSD in six locations up and running Cassandra in 20 minutes Test LoadValidation Load1 Million Reads after 500 ms CL.ONE with No Data LossUS-West-2 Region - OregonTest Load1 Million Writes CL.ONE (Wait for One Replica to ack)US-East-1 Region - VirginiaZone AZone BZone CZone AZone BZone CCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra ReplicasInterzone TrafficInterregion Traffic Up to 9Gbits/s, 83ms18 TB backups from S3 26. Propagating EVCache Invalidations 4. Calls Writer with Key, Write Time, TTL & Value after checking if this is the latest event for the key in the current batch. Goes cross-region through ELB over HTTPSDrainer Writer7. Return keys that were successful3. Read from SQS in batchesDrainer WriterEVCache Replication ServiceEVCache Replication Service6. Deletes the value for the key EVCache Replication Metadata5. Checks write time to ensure this is the latest operation for the keyUS-WEST-2EVCAC EVCAC EVCACHE HE HE8. Delete keys from SQS that were successful 1. Set data in EVCACHEEVCACH EVCACH EE EVCACHEEVCache Replication Metadata2. Write events to SQS and EVCACHE_REGION_REPLICATIONSQS EVCache ClientApp ServerUS-EAST-1 27. Archaius Region-isolated Configuration 28. Running Isthmus and Active-Active 29. Multiregional Monkeys Detect failure to deploy Differences in configuration Resource differences 30. Multiregional Monitoring and Alerting Per region metrics Global aggregation Anomaly detection 31. Failover and Fallback DNS (denominator) changes For fallback, ensure data consistency Some challenges Cold cache Autoscaling Automate, automate, automate 32. Validating the Whole Thing Works 33. Dev-Ops in N Regions Best practices: avoiding peak times for deployment Early problem detection / rollbacks Automated canaries / continuous delivery 34. Hyperscale ArchitectureAmazon Route 53DynECT DNSUltraDNSDNS AutomationRegional Load BalancersRegional Load BalancersZone AZone BZone CZone AZone BZone CCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra ReplicasCassandra Replicas 35. Does It Work? 36. Building Blocks Available on Netflix Github site 37. TopicSession #WhenWhat an Enterprise Can Learn from Netflix, a Cloud-native CompanyENT203Thursday, Nov 14, 4:15 PM - 5:15 PMMaximizing Audience Engagement in Media DeliveryMED303Thursday, Nov 14, 4:15 PM - 5:15 PMScaling your Analytics with Amazon Elastic MapReduceBDT301Thursday, Nov 14, 4:15 PM - 5:15 PMAutomated Media Workflows in the CloudMED304Thursday, Nov 14, 5:30 PM - 6:30 PMDeft Data at Netflix: Using Amazon S3 and Amazon Elastic MapReduce for Monitoring at GigascaleBDT302Thursday, Nov 14, 5:30 PM - 6:30 PMEncryption and Key Management in AWSSEC304Friday, Nov 15, 9:00 AM - 10:00 AMYour Linux AMI: Optimization and PerformanceCPN302Friday, Nov 15, 11:30 AM - 12:30 PM 38. Takeaways Embrace isolation and redundancy for availability NetflixOSS helps everyone to become cloud native http://netflix.github.com http://techblog.netflix.com http://slideshare.net/Netflix @rusmeshenberg @NetflixOSS 39. We are sincerely eager to hear your feedback on this presentation and on re:Invent. Please fill out an evaluation form when you have a chance.