epidemic failures
DESCRIPTION
Slides originally written in April 2013 for a private conference and internal use at Netflix. Publishing now since Heartbleed is another example of an epidemic failure mode.TRANSCRIPT
Cloud Native and Epidemic Failures
April 2014Adrian Cockcroft
@adrianco @BatteryVentureshttp://www.linkedin.com/in/adriancockcroft
Cloud Native?
Epidemic Failures
Automated Diversity
Cloud Native
Construct a highly agile and highly available service from ephemeral and
often broken components
Inspiration
Numquam ponenda est pluralitas sine necessitate
Plurality must never be posited without necessity
Occam’s Razor
Monoculture
Replicate “the best” as patternsReduce interaction complexityEpidemic single point of failure
Pattern Failures
Infrastructure Pattern FailuresSoftware Stack Pattern Failures
Application Pattern Failures
Infrastructure Pattern Failures
• Device failures – bad batch of disks, PSUs, etc.• CPU failures – cache corruption, math errors• Datacenter failures – power, network, disaster• Routing failures – DNS, Internet/ISP path
Software Stack Pattern Failures
• Time bombs – Counter wrap, memory leak• Date bombs - Leap year, leap second, epoch• Expiration – Certs timing out• Trust revocation – Certificate Authority fails• Security exploit – e.g. heartbleed• Language bugs – compile time• Runtime bugs – JVM, Linux, Hypervisor• Network bugs – routers, firewalls, protocols
Application Pattern Failures
• Time bombs – Counter wrap, memory leak• Date bombs - Leap year, leap second, epoch• Content bombs – Data dependent failure• Configuration – wrong/bad syntax• Versioning – incompatible mixes• Cascading failures – error handling bugs etc.• Cascading overload – excessive logging etc.
What to do?
Automated diversity managementDiversified automationEfficient vs. Antifragile
Specific Ideas
• Automate running a mixture– Diversity as default for any service stack– No developer overhead, stay agile, low cost
• Support oldest and newest versions together – Automate running 50/50 mix CentOS/Ubuntu– Mix versions of JDK, Tomcat, etc.
• Vendor diversity– Multiple DNS vendors, cloud regions, costs more– Multiple cloud vendors? Much higher cost.
Generate Permutations> epi <- data.frame(java=gl(2,1,8,c("java6","java7")), linux=gl(2,2,8,c("centos","ubuntu")), codeversion=gl(2,4,8,c("v34","v35")))> epi java linux codeversion1 java6 centos v342 java7 centos v343 java6 ubuntu v344 java7 ubuntu v345 java6 centos v356 java7 centos v357 java6 ubuntu v358 java7 ubuntu v35
Deployment
• Builds– Manual to test, automate if it works– Modify build to generate permutation AMIs– Modify Asgard to auto-deploy permutations
• Data collection– Tag each instance with its permutation– Gather metrics by permutation per instance– Do R-based Design of Experiments analysis
Analysis
• As a function of permutations– Error rate– Response time– CPU Utilization
• Interactions– E.g. interaction between linux and java– Contrasts identify components with issues– Small changes with high statistical significance
GCS Total API Outage for ~1hr
Takeaway
Watch out for monocultures
A|B Testing – it’s not just for personalization
http://perfcap.blogspot.comhttp://slideshare.net/adrianco – Netflix
http://slideshare.net/adriancockcroft - Battery
http://www.linkedin.com/in/adriancockcroft
@adrianco @BatteryVentures