architecting for failure - why are distributed systems hard?

Architecting for Failure Why are distributed systems so hard?

Markus Eisele

@myfear

Evolution

Extreme Uptime (99.999)

Vertical Scaling

Custom Hardware

Hardware High Availability

Centralized

Designed for availability (99.9)

Commodity Hardware

Replicated

Designed for failure (99.999)

Horizontal Scaling

Virtualized / Cloud

Software High Availability

Distributed

Centralized Shared Self Service

“Big Iron” “Enterprise” “Cloud”

60s 80s 90s 2000 2014 2016 2020 2030

Num

ber o

f Ent

erpr

ise

Proj

ects

Mainframe Enterprise Cloud

Distribution of Projects over time.Disclaimer:My personal prediction!

Today’s biggest problem?

High Infrastructure Cost11%

Awful Downtime9%

Meeting Demand21%

Release Frquency20%

Developer Velocity39%

Meeting demands.

http

://w

ww

.inte

rnet

lives

tats

.com

/inte

rnet

-use

rs/

J2EE

Spring

RoR

Akka

Reactive Manifesto

Microservices

What the hell is “Developer Velocity“ anyway?

Release frequency!!

bit.ly/helloworldmsa

And this is why we have Microservices..

ScaleDeployDevelopIndependently

REQ: Building and Scaling Microservices

• Lightweight runtime• Cross – Service Security• Transaction Management• Service Scaling• Load Balancing• SLA’s• Flexible Deployment• Configuration• Service Discovery• Service Versions

• Monitoring• Governance• Asynchronous communication• Non-blocking I/O• Streaming Data• Polyglot Services• Modularity (Service definition)• High performance persistence (CQRS)• Event handling / messaging (ES)• Eventual consistency• API Management• Health check and recovery

If the components do not compose cleanly, then all you are doing is shifting complexity from inside a component to the connections between components. Not just does this just move complexity around, it moves it to a place that's less explicit and harder to control.Martin Fowler

https://martinfowler.com/articles/microservices.html

“

How do we handle “failures” in centralized or shared infrastructures?

Why did Application Server become a thing?

• Network and Threading• Two Phase Commit (2PC)• Shared resources• Manageability• Clustering supports scalability,

performance, and availability.• Programing models• Standardization

https://antoniogoncalves.org/2013/07/03/monster-component-in-java-ee-7/

Checked vs. Unchecked Exceptions

If a client can reasonably be expected to recover from an exception, make it a checked exception. If a client cannot do anything to recover from the exception, make it an unchecked exception.

“

https://docs.oracle.com/javase/tutorial/essential/exceptions/runtime.html

It wasn’t easy – but manageable.

https://docs.oracle.com/javase/tutorial/essential/exceptions/runtime.html

• MVC handles checked• Global exception handlers handle unchecked• Centralized log files

'If it ain't broke, don't fix it!' Bert Lance 1977.

“

What is different for Microservices?

Microservices are Distributed Systems.

• Reactive Microservices Framework for the JVM• Focused on right sized services• Asynchronous I/O and communication as first class

priorities• Highly productive development environment• Takes you all the way to production• https://github.com/lagom/online-auction-java

What is Lagom?

Protect Yourself

with Circuit Breakers

CircuitBreakers

Circuit Breakersdefault Descriptor descriptor() {

return named("item").withCalls(pathCall("/api/item", this::createItem),restCall(Method.POST, "/api/item/:id/start", this::startAuction),pathCall("/api/item/:id", this::getItem),restCall(Method.PUT, "/api/item/:id", this::updateItem),pathCall("/api/item?userId&status", this::getItemsForUser))

.withCircuitBreaker(CircuitBreaker.identifiedBy("item"))

Degraded beats

Unavailable

Degraded > Unavailable

Search

Bid

Item

Degraded>Unavailable

Search

Bid

Item

CompletionStage<PSequence<Bid>> bidHistoryFuture = bidService.getBids(itemUuid)

.invoke().exceptionally(error -> {log.warn("Bidding service failed to load", error);

return TreePVector.empty()});

https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/CompletionStage.html#exceptionally-java.util.function.Function-

Bulkheading(Kind of Important)

Duplication isn’t a bad

thing

Degraded > Unavailable

Search

Bid

Item

Publish/SubscribeTopic<BidEvent> bidEvents();

default Descriptor descriptor() {return named("bidding").withCalls(

pathCall("/api/item/:id/bids", this::placeBid),pathCall("/api/item/:id/bids", this::getBids)

).publishing(topic("bidding-BidEvent", this::bidEvents)

)

Publish/SubscribeTopic<BidEvent> bidEventTopic = biddingService.bidEvents();bidEventTopic.subscribe()

.atLeastOnce(Flow.<BidEvent>create().map(this::toDocument).mapAsync(1, indexedStore::store));

Always have a plan B.

•Fallback pattern (cache instead of dB)•The cost of resilience should be accuracy or latency.

•CAP Theorem: Your choice: sacrifice availability or consistency. You can't have all three.

What you can do..

https://codahale.com/you-cant-sacrifice-partition-tolerance/

Do you remember?

8 fallacies of distributed computing

1.Thenetworkisreliable2.Latencyiszero3.Bandwidthisinfinite4.Thenetworkissecure5.Topologydoesn'tchange6.Thereisoneadministrator7.Transportcostiszero8.Thenetworkishomogeneous

Lessons learned.

Some things to remember.

•Distributedsystemsaredifferentbecausetheyfailoften.•Writingrobustdistributedsystemscostsmorethanwritingrobustsingle-machinesystems.

•Robust,opensourcedistributedsystemsaremuchlesscommonthanrobust,single-machinesystems.

•Coordinationisveryhard.• “It’sslow”isthehardestproblemyou’lleverdebug.• Findwaystobepartiallyavailable.

https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/

Where do we go from here?

http://www.ofbizian.com/2016/07/from-fragile-to-antifragile-software.html

Next Steps! Download and try Lagom!Project Site:http://www.lightbend.com/lagom

GitHub Repo:https://github.com/lagom

Documentation:http://www.lagomframework.com/documentation/1.3.x/java/Home.html

Example:https://github.com/lagom/online-auction-java

Written for architects and developers that must quickly gain a fundamental understanding of microservice-based architectures, this free O’Reilly report explores the journey from SOA to microservices, discusses approaches to dismantling your monolith, and reviews the key tenets of a Reactive microservice:

• Isolate all the Things• Act Autonomously• Do One Thing, and Do It Well• Own Your State, Exclusively• Embrace Asynchronous Message-Passing• Stay Mobile, but Addressable• Collaborate as Systems to Solve Problems

http://bit.ly/ReactiveMicroservice

The detailed example in this report is based on Lagom, a new framework that helps you follow the requirements for building distributed, reactive systems.

• Get an overview of the Reactive Programming model and basic requirements for developing reactive microservices

• Learn how to create base services, expose endpoints, and then connect them with a simple, web-based user interface

• Understand how to deal with persistence, state, and clients

• Use integration technologies to start a successful migration away from legacy systems

http://bit.ly/DevelopReactiveMicroservice

architecting for failure - why are distributed systems hard?

Technology