dpc 2016 - 53 minutes or less - architecting for failure
TRANSCRIPT
![Page 1: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/1.jpg)
53 Minutes or Less - Architecting For Failure In
The CloudBen Andersen-Waine
![Page 2: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/2.jpg)
53 Minutes?
![Page 3: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/3.jpg)
99.99%
![Page 4: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/4.jpg)
Availability (%) Year Month Week
90 36.5 Days 72 Hours 16.8 Hours
99 3.65 Days 7.2 Hours 1.68 Hours
99.9 8.76 Hours 43.8 Min 10.1 Min
99.99 52.56 Min 4.38 Min 1.01 Min
Adapted From: https://en.wikipedia.org/wiki/High_availability
![Page 5: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/5.jpg)
Architecting For Failure?
![Page 6: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/6.jpg)
Who are you?
1) You have some kind of web application / service
2) You are using an IaaS cloud provider
3) The service needs to be “highly available”
![Page 8: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/8.jpg)
![Page 9: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/9.jpg)
Infrastructure
![Page 10: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/10.jpg)
Infrastructure
• Regions & Availability Zones
• Autoscaling
• Multi Region
![Page 11: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/11.jpg)
Regions And Availability Zones
“Each region is a separate geographic area. Each region has multiple, isolated locations known as Availability Zones. Amazon EC2 provides you the ability to place resources, such as instances, and data in multiple locations. Resources aren't replicated across regions unless you do so specifically.”
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html
![Page 12: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/12.jpg)
http://aws.amazon.com/about-aws/global-infrastructure/
![Page 13: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/13.jpg)
![Page 14: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/14.jpg)
![Page 15: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/15.jpg)
![Page 16: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/16.jpg)
![Page 17: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/17.jpg)
![Page 18: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/18.jpg)
![Page 19: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/19.jpg)
![Page 20: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/20.jpg)
![Page 21: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/21.jpg)
Auto Scaling
“Auto Scaling helps you maintain application availability and allows you to scale your Amazon EC2 capacity up or down automatically according to conditions you define. ”
https://aws.amazon.com/autoscaling/
![Page 22: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/22.jpg)
Auto Scaling
• Instance metrics (useful for containers)
• Load balancer health check (useful for web apps on EC2)
![Page 23: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/23.jpg)
![Page 24: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/24.jpg)
![Page 25: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/25.jpg)
![Page 26: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/26.jpg)
![Page 27: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/27.jpg)
![Page 28: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/28.jpg)
![Page 29: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/29.jpg)
![Page 30: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/30.jpg)
![Page 31: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/31.jpg)
Multi Region
![Page 32: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/32.jpg)
Devops
![Page 33: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/33.jpg)
One day I had this fantasy of starting a certification service for operations. The certification assessment would consist of a colleague and I turning up at the corporate data center and setting about critical production servers with a baseball bat, a chainsaw, and a water pistol. The assessment would be based on how long it would take for the operations team to get all the applications up and running again.
http://martinfowler.com/bliki/PhoenixServer.html
![Page 34: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/34.jpg)
Immutable Infrastructure
![Page 35: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/35.jpg)
Devops• Environment Creating
• Releasing
• Secret Management
• Service Discovery
![Page 36: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/36.jpg)
Environment Creation
• Vendors Tool (AWS Cloud Formation / GCE Cloud Deployment Manager)
• 3rd Party Solution - Terraform, Ansible
![Page 37: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/37.jpg)
Immutable Infrastructure
http://martinfowler.com/bliki/SnowflakeServer.html
Configuration changes are regularly needed to tweak the environment so that it runs efficiently and communicates properly with other systems. This requires some mix of command-line invocations, jumping between GUI screens, and editing text files.
The result is a unique snowflake - good for a ski resort, bad for a data center.
![Page 38: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/38.jpg)
Releases: Build An Artifact
• Build A VM (AWS ami / GCE image)
• Use Containers
![Page 39: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/39.jpg)
Releases: Building A VM
![Page 40: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/40.jpg)
Releases: Building A Container
![Page 41: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/41.jpg)
Releases: Canarys
http://martinfowler.com/bliki/CanaryRelease.html
![Page 42: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/42.jpg)
![Page 43: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/43.jpg)
Releases: Blue / Green Deploy
https://cloudnative.io/blog/2015/02/the-dos-and-donts-of-bluegreen-deployment/
![Page 44: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/44.jpg)
![Page 45: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/45.jpg)
![Page 46: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/46.jpg)
![Page 47: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/47.jpg)
![Page 48: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/48.jpg)
Service Discovery
https://www.nginx.com/blog/service-discovery-in-a-microservices-architecture/
![Page 49: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/49.jpg)
![Page 50: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/50.jpg)
![Page 51: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/51.jpg)
![Page 52: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/52.jpg)
![Page 53: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/53.jpg)
Service Discovery
• https://github.com/coreos/etcd
• https://www.consul.io/
• https://zookeeper.apache.org/
![Page 54: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/54.jpg)
Secrets
![Page 55: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/55.jpg)
• Use secret keeper or vault
• Use environment variables
Secrets
![Page 56: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/56.jpg)
Secrets
![Page 57: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/57.jpg)
Secrets
![Page 58: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/58.jpg)
Secrets
![Page 59: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/59.jpg)
Secrets
![Page 60: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/60.jpg)
• https://www.vaultproject.io/
• https://square.github.io/keywhiz/
Secrets
![Page 61: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/61.jpg)
Software Development
![Page 62: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/62.jpg)
General Best Practise
• Write tests (preferably first)
• Continuously integrate
• Write Documentation
![Page 63: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/63.jpg)
Problem: Services Go Away
![Page 64: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/64.jpg)
Circuit Breaking
http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
![Page 65: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/65.jpg)
Circuit Breaking
http://techblog.netflix.com/2011/12/making-netflix-api-more-resilient.html
![Page 66: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/66.jpg)
Circuit Breaking
Available solutions:
• https://github.com/Netflix/Hystrix
• https://github.com/ejsmont-artur/php-circuit-breaker
![Page 67: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/67.jpg)
Problem: Spikey Workloads
![Page 68: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/68.jpg)
Queue Based Load Levelling
https://msdn.microsoft.com/en-gb/library/dn589783.aspx
![Page 69: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/69.jpg)
Priority Queue
https://msdn.microsoft.com/en-gb/library/dn589794.aspx
![Page 70: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/70.jpg)
Competing Consumers
https://msdn.microsoft.com/en-gb/library/dn568101.aspx
![Page 71: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/71.jpg)
Monitoring / SLAs
![Page 72: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/72.jpg)
SLA - Service Level Agreement
http://www.nkarten.com/handbook.pdf
![Page 73: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/73.jpg)
Monitoring
![Page 74: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/74.jpg)
Obligatory Meme
![Page 75: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/75.jpg)
![Page 76: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/76.jpg)
The Simian Army
http://techblog.netflix.com/2011/07/netflix-simian-army.htmlhttps://github.com/Netflix/SimianArmy/
![Page 77: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/77.jpg)
Final Thoughts
![Page 78: DPC 2016 - 53 Minutes or Less - Architecting For Failure](https://reader031.vdocuments.site/reader031/viewer/2022022412/58f297b91a28ab06548b4567/html5/thumbnails/78.jpg)
Questions