architecting for failure - why are distributed systems hard?
TRANSCRIPT
Architecting for Failure Why are distributed systems so hard?
Markus Eisele
@myfear
Evolution
Extreme Uptime (99.999)
Vertical Scaling
Custom Hardware
Hardware High Availability
Centralized
Designed for availability (99.9)
Commodity Hardware
Replicated
Designed for failure (99.999)
Horizontal Scaling
Virtualized / Cloud
Software High Availability
Distributed
Centralized Shared Self Service
“Big Iron” “Enterprise” “Cloud”
60s 80s 90s 2000 2014 2016 2020 2030
Num
ber o
f Ent
erpr
ise
Proj
ects
Mainframe Enterprise Cloud
Distribution of Projects over time.Disclaimer:My personal prediction!
Today’s biggest problem?
High Infrastructure Cost11%
Awful Downtime9%
Meeting Demand21%
Release Frquency20%
Developer Velocity39%
Meeting demands.
http
://w
ww
.inte
rnet
lives
tats
.com
/inte
rnet
-use
rs/
J2EE
Spring
RoR
Akka
Reactive Manifesto
Microservices
What the hell is “Developer Velocity“ anyway?
Release frequency!!
bit.ly/helloworldmsa
And this is why we have Microservices..
ScaleDeployDevelopIndependently
REQ: Building and Scaling Microservices
• Lightweight runtime• Cross – Service Security• Transaction Management• Service Scaling• Load Balancing• SLA’s• Flexible Deployment• Configuration• Service Discovery• Service Versions
• Monitoring• Governance• Asynchronous communication• Non-blocking I/O• Streaming Data• Polyglot Services• Modularity (Service definition)• High performance persistence (CQRS)• Event handling / messaging (ES)• Eventual consistency• API Management• Health check and recovery
If the components do not compose cleanly, then all you are doing is shifting complexity from inside a component to the connections between components. Not just does this just move complexity around, it moves it to a place that's less explicit and harder to control.Martin Fowler
https://martinfowler.com/articles/microservices.html
“
How do we handle “failures” in centralized or shared infrastructures?
Why did Application Server become a thing?
• Network and Threading• Two Phase Commit (2PC)• Shared resources• Manageability• Clustering supports scalability,
performance, and availability.• Programing models• Standardization
https://antoniogoncalves.org/2013/07/03/monster-component-in-java-ee-7/
Checked vs. Unchecked Exceptions
If a client can reasonably be expected to recover from an exception, make it a checked exception. If a client cannot do anything to recover from the exception, make it an unchecked exception.
“
https://docs.oracle.com/javase/tutorial/essential/exceptions/runtime.html
It wasn’t easy – but manageable.
https://docs.oracle.com/javase/tutorial/essential/exceptions/runtime.html
• MVC handles checked• Global exception handlers handle unchecked• Centralized log files
'If it ain't broke, don't fix it!' Bert Lance 1977.
“
What is different for Microservices?
Microservices are Distributed Systems.
• Reactive Microservices Framework for the JVM• Focused on right sized services• Asynchronous I/O and communication as first class
priorities• Highly productive development environment• Takes you all the way to production• https://github.com/lagom/online-auction-java
What is Lagom?
Protect Yourself
with Circuit Breakers
CircuitBreakers
CircuitBreakers
CircuitBreakers
CircuitBreakers
Circuit Breakersdefault Descriptor descriptor() {
return named("item").withCalls(pathCall("/api/item", this::createItem),restCall(Method.POST, "/api/item/:id/start", this::startAuction),pathCall("/api/item/:id", this::getItem),restCall(Method.PUT, "/api/item/:id", this::updateItem),pathCall("/api/item?userId&status", this::getItemsForUser))
.withCircuitBreaker(CircuitBreaker.identifiedBy("item"))
Degraded beats
Unavailable
Degraded > Unavailable
Search
Bid
Item
Degraded>Unavailable
Search
Bid
Item
CompletionStage<PSequence<Bid>> bidHistoryFuture = bidService.getBids(itemUuid)
.invoke().exceptionally(error -> {log.warn("Bidding service failed to load", error);
return TreePVector.empty()});
https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/CompletionStage.html#exceptionally-java.util.function.Function-
Bulkheading(Kind of Important)
Duplication isn’t a bad
thing
Degraded > Unavailable
Search
Bid
Item
Publish/SubscribeTopic<BidEvent> bidEvents();
default Descriptor descriptor() {return named("bidding").withCalls(
pathCall("/api/item/:id/bids", this::placeBid),pathCall("/api/item/:id/bids", this::getBids)
).publishing(topic("bidding-BidEvent", this::bidEvents)
)
Publish/SubscribeTopic<BidEvent> bidEventTopic = biddingService.bidEvents();bidEventTopic.subscribe()
.atLeastOnce(Flow.<BidEvent>create().map(this::toDocument).mapAsync(1, indexedStore::store));
Always have a plan B.
•Fallback pattern (cache instead of dB)•The cost of resilience should be accuracy or latency.
•CAP Theorem: Your choice: sacrifice availability or consistency. You can't have all three.
What you can do..
https://codahale.com/you-cant-sacrifice-partition-tolerance/
Do you remember?
8 fallacies of distributed computing
1.Thenetworkisreliable2.Latencyiszero3.Bandwidthisinfinite4.Thenetworkissecure5.Topologydoesn'tchange6.Thereisoneadministrator7.Transportcostiszero8.Thenetworkishomogeneous
Lessons learned.
Some things to remember.
•Distributedsystemsaredifferentbecausetheyfailoften.•Writingrobustdistributedsystemscostsmorethanwritingrobustsingle-machinesystems.
•Robust,opensourcedistributedsystemsaremuchlesscommonthanrobust,single-machinesystems.
•Coordinationisveryhard.• “It’sslow”isthehardestproblemyou’lleverdebug.• Findwaystobepartiallyavailable.
https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
Where do we go from here?
http://www.ofbizian.com/2016/07/from-fragile-to-antifragile-software.html
Next Steps! Download and try Lagom!Project Site:http://www.lightbend.com/lagom
GitHub Repo:https://github.com/lagom
Documentation:http://www.lagomframework.com/documentation/1.3.x/java/Home.html
Example:https://github.com/lagom/online-auction-java
Written for architects and developers that must quickly gain a fundamental understanding of microservice-based architectures, this free O’Reilly report explores the journey from SOA to microservices, discusses approaches to dismantling your monolith, and reviews the key tenets of a Reactive microservice:
• Isolate all the Things• Act Autonomously• Do One Thing, and Do It Well• Own Your State, Exclusively• Embrace Asynchronous Message-Passing• Stay Mobile, but Addressable• Collaborate as Systems to Solve Problems
http://bit.ly/ReactiveMicroservice
The detailed example in this report is based on Lagom, a new framework that helps you follow the requirements for building distributed, reactive systems.
• Get an overview of the Reactive Programming model and basic requirements for developing reactive microservices
• Learn how to create base services, expose endpoints, and then connect them with a simple, web-based user interface
• Understand how to deal with persistence, state, and clients
• Use integration technologies to start a successful migration away from legacy systems
http://bit.ly/DevelopReactiveMicroservice