architecting for failures in micro services: patterns and lessons learned
TRANSCRIPT
![Page 1: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/1.jpg)
Architecting for failures in micro services:
patterns and lessons learned
Bhakti Mehta
@bhakti_mehta
![Page 2: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/2.jpg)
Introduction
• Platform@Atlassian
• In the past Platform Lead at BlueJeans Network
• Worked at Sun Microsystems/Oracle for 13 years
• Committer to numerous open source projects including GlassFish Application Server
![Page 3: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/3.jpg)
My recent book
![Page 4: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/4.jpg)
Previous book
![Page 5: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/5.jpg)
What you will learn
• Path to micro services
• Challenges at scale
• Lessons learned, tips and practices to prevent cascading failures
• Resilience planning at various stages
• Real world examples
![Page 6: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/6.jpg)
Path to micro services
• Advantages –Simplicity – Isolation of problems –Scale up and scale down –Easy deployment –Clear separation of concerns –Heterogeneity and polyglotism
![Page 7: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/7.jpg)
Sounds great!!
![Page 8: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/8.jpg)
In reality……..
![Page 9: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/9.jpg)
Monoliths to Micro services
![Page 10: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/10.jpg)
Path to micro services• Disadvantages –Not a free lunch! –Distributed systems prone to failures –Eventual consistency –More effort in terms of deployments, release
managements – Challenges in testing the various services evolving
independently, regression tests etc
![Page 11: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/11.jpg)
Resilient system• Processes transactions, even when there are transient
impulses, persistent stresses
• Functions even when there are component failures disrupting normal processing
• Accepts failures will happen
• Designs for crumple zones
![Page 12: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/12.jpg)
Kinds of failures• Challenges at scale
• Integration point failures • Network errors • Semantic errors. • Slow responses • Outright hang • GC issues
![Page 13: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/13.jpg)
![Page 14: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/14.jpg)
Challenges at scale
![Page 15: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/15.jpg)
Anticipate failures at scale• Anticipate growth
• Design for next order of magnitude
• Design for 10x plan to rewrite for 100x
![Page 16: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/16.jpg)
Architecting for failures
![Page 17: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/17.jpg)
The more you sweat on the field the less you bleed in war!!!
![Page 18: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/18.jpg)
Resiliency planning Stage 1• When developing code
• Avoiding Cascading failures • Circuit breaker • Timeouts • Retry • Bulkhead • Cache optimizations
• Avoid malicious clients • Rate limiting
![Page 19: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/19.jpg)
Resiliency planning Stage 2• Planning for dealing with failures before deploy
• load test • a/b test • longevity
![Page 20: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/20.jpg)
Resiliency planning Stage 3• Watching out for failures after deploy
• health check • metrics
![Page 21: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/21.jpg)
![Page 22: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/22.jpg)
Cascading failures
![Page 23: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/23.jpg)
Cascading failuresCaused by Chain reactions For example One node in a load balance group fails Others need to pick up work Eventually performance can degenerate
![Page 24: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/24.jpg)
Cascading failures with aggregation
![Page 25: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/25.jpg)
Cascading failure with aggregation
![Page 26: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/26.jpg)
Timeouts pattern
![Page 27: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/27.jpg)
Timeouts• Clients may prefer a response
• failure • success • job queued for later All aggregation requests to microservices should have reasonable timeouts set
![Page 28: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/28.jpg)
Types of Timeouts
• Connection timeout • Max time before connection can be established or
Error
• Socket timeout • Max time of inactivity between two packets once
connection is established
![Page 29: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/29.jpg)
Timeouts pattern• Timeouts + Retries go together
• Transient failures can be remedied with fast retries
• However problems in network can last for a while so probability of retries failing
![Page 30: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/30.jpg)
Retry pattern• Retry for failures in case of network failures, timeouts
or server errors
• Helps transient network errors such as dropped connections or server fail over
![Page 31: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/31.jpg)
Retry pattern• If one of the services is slow or malfunctioning and
other services keep retrying then the problem becomes worse
• Solution • Exponential back off • Circuit breaker pattern
![Page 32: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/32.jpg)
Circuit breaker pattern
Circuit breaker A circuit breaker is an electrical device used in an electrical panel that monitors
and controls the amount of amperes (amps) being sent through
![Page 33: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/33.jpg)
Circuit breaker pattern• Safety device
• If a power surge occurs in the electrical wiring, the breaker will trip.
• Flips from “On” to “Off” and shuts electrical power from that breaker
![Page 34: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/34.jpg)
Bulkhead
![Page 35: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/35.jpg)
Bulkhead• Avoiding chain reactions by isolating failures
• Helps prevent cascading failures
![Page 36: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/36.jpg)
Bulkhead• An example of bulkhead could be isolating the
database dependencies per service
• Similarly other infrastructure components can be isolated such as cache infrastructure
![Page 37: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/37.jpg)
Rate limiting
![Page 38: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/38.jpg)
Rate Limiting• Restricting the number of requests that can be made
by a client
• Client can be identified based on the access token used
• Additionally clients can be identified based on IP address
![Page 39: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/39.jpg)
Rate Limiting• With JAX-RS Rate limiting can be implemented as a
filter
• This filter can check the access count for a client and if within limit accept the request
• Else throw a 429 Error
• Code at https://github.com/bhakti-mehta/samples/tree/master/ratelimiting
![Page 40: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/40.jpg)
Cache optimizations• Stores response information related to requests in a
temporary storage for a specific period of time
• Ensures that server is not burdened processing those requests in future when responses can be fulfilled from the cache
![Page 41: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/41.jpg)
Cache optimizationsGetting from first level cache
Getting from second
level cache
Getting from the DB
![Page 42: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/42.jpg)
Dealing with latencies in response
• Have a timeout for the aggregation service
• Dispatch requests in parallel and collect responses
• Associate a priority with all the responses collected
![Page 43: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/43.jpg)
Handling partial failures best practices
• One service calls another which can be slow or unavailable
• Never block indefinitely waiting for the service
• Try to return partial results
• Provide a caching layer and return cached data
![Page 44: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/44.jpg)
Logging• Complex distributed systems introduce many points
of failure • Logging helps link events/transactions between
various components that make an application or a business service
• ELK stack • Splunk, syslog • Loggly • LogEntries
![Page 45: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/45.jpg)
Logging best practices• Include detailed, consistent pattern across service
logs
• Obfuscate sensitive data
• Identify caller or initiator as part of logs
• Do not log payloads by default
![Page 46: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/46.jpg)
Best practices when designing APIs for mobile clients
• Avoid chattiness • Use aggregator pattern
![Page 47: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/47.jpg)
Thoughts of the on call person paged at 3 am debugging an issue
![Page 48: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/48.jpg)
Resilience planning Stage 2• Before deploy
• Load testing • Longevity testing • Capacity planning
![Page 49: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/49.jpg)
Load testing• Ensure that you test for load on APIs
• Plan for longevity testing
![Page 50: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/50.jpg)
Capacity Planning• Anticipate growth
• Design for handling exponential growth
![Page 51: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/51.jpg)
Resilience planning Stage 3• After deploy
• Health check • Metrics and Monitoring • Phased rollout of features
![Page 52: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/52.jpg)
Health Check
![Page 53: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/53.jpg)
Health Check• Memory
• CPU
• Threads
• Error rate
• If any of the checks exceed a threshold send alert
![Page 54: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/54.jpg)
Metrics and Monitoring
![Page 55: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/55.jpg)
Metrics• Response times, throughput
• Identify slow running DB queries
• GC rate and pause duration • Garbage collection can cause slow responses
• Monitor unusual activity
![Page 56: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/56.jpg)
Metrics• Load average
• Uptime
• Log sizes
• Response times
![Page 57: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/57.jpg)
Monitoring
Monitoring server
Production EnvironmentCHECKS
ALERTS
![Page 58: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/58.jpg)
Rollout of new features• Phasing rollout of new features
• Have a way to turn features off if not behaving as expected
• Alerts and more alerts!
![Page 59: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/59.jpg)
Real time examples• Netflix's Simian Army induces failures of services and
even datacenters during the working day to test both the application's resilience and monitoring.
• Latency Monkey to simulate slow running requests
• Wiremock to mock services
• Saboteur to create deliberate network mayhem
![Page 60: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/60.jpg)
Takeaway• Inevitability of failures
• Expect systems will fail • Failure prevention • Automate
![Page 61: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/61.jpg)
![Page 62: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/62.jpg)
References• https://commons.wikimedia.org/wiki/File:Bulkhead_PSF.png • https://en.wikipedia.org/wiki/Circuit_breaker#/media/
File:Four_1_pole_circuit_breakers_fitted_in_a_meter_box.jpg • http://weknowyourdreams.com/image.php?pic=/images/happiness/
happiness-04.jpg • http://www.fitnessandpower.com/wp-content/uploads/2013/10/military-fitness.jpg • http://cdn1.tnwcdn.com/wp-content/blogs.dir/1/files/2010/10/speed-limit-change-
sign-resized_2.jpg • https://www.askideas.com/media/51/Funny-Grumpy-Cat-Some-People-Just-Need-
A-Hug-Around-The-Neck-With-A-Rope-Image.jpg • https://www.flickr.com/photos/skynoir/ Beer in hand: skynoir/Flickr/Creative
Commons License
![Page 63: Architecting for Failures in micro services: patterns and lessons learned](https://reader038.vdocuments.site/reader038/viewer/2022103010/5887162c1a28abf2228b717f/html5/thumbnails/63.jpg)
Questions• Twitter: @bhakti_mehta • Email: [email protected]