Resilience, Service-‐Discovery and Z-‐D Deployment
Bodo Junglas, York Xylander
100
Who is leanovate? 2
25 people, HQ in Kreuzberg
Mission: Build learning organizaIons
100
Dev
Java,Scala, SPA,..
Meth
Kanban, Scrum, Lean*,..
What do we do? 3
ConsulIng
Coaching
Training
Doing
100
Promises of µServices architectures 4
Independent (parallel) development
Independent release cycles
Independent scaling
100
Challenges of µServices architectures 5
• Developing & Running
• Configuring
• Debugging
• Deploying
• Discovering
• Resilience
• …
Tough to learn & understand!
Microzon: A lab for µServices
hYps://github.com/leanovate/microzon
1001
Case Study: Microzon-‐Shop
• Browse through catalog • Select product • Put into cart • Checkout
7
DEMO
1001
Case Study: Microzon-‐Shop 8
Web Facade
Customer Product Cart Billing
Load-Balancer / Firewall
mysql mongo mysql mysql
1001
Technology-‐Babylon 9
Customer Product Cart Billing
mysql mongo mysql mysql
Web FacadePlay 2.3
Scala 2.11
Spring-Boot 1.2.1JPA/Hibernate
Spray 1.3 / Akka 2.3Scala 2.10
ReactiveMongo
Scalatra Jetty…finatra 1.6
async-mysqlfinagle 6.20
Dropwizard 0.7JDBI
1001
Challenges: 10
• Running:
• How long does it take to get a dev system up and running for a new team member?
• How to run your ci system it in your favorite cloud?
• Configuring
• Debugging
1001
Case Study: Microzon-‐Shop
• Start system with docker • Create products • Logstash • Zipkin
11
DEMO
1001
Why is that a challenge? 12
Web Facade
Customer Product Cart Billing
mysql mongo mysql mysql
1001
Why is that a challenge? 13
Web Facade
Customer Product Cart Billing
mysql mongo mysql mysql
1001
Why is that a challenge? 14
Web Facade
Customer Product Cart Billing
mysql mongo mysql mysql
Support UI Marketing tool Reporting
Paymentadapter
1001
Why is that a challenge? 15
Web Facade
Customer Product Cart Billing
mysql mongo mysql mysql
Support UI Marketing tool Reporting
Paymentadapter?
100
Running/Configuring/Debugging
Takeaway 16
• … will quickly become a non-‐trivial maYer • We have chosen …
• docker for development system • puppet for »producIon« • elas.csearch/logstash/kibana for distributed logging
• zipkin for request tracing • … but that is not this focus of this talk
1001
Challenge: Deploying 17
Zero-downtime deployment strategies/variants:
Blue/Green, Wave, Canary,…
ServiceVersion n+1
ServiceVersion n
Load
bala
ncer
/Ro
uter Delivery
Pipeline
ServiceVersion n
Deploy
1001
Challenge: Service discovery 18
• How to remove service nodes from the cluster or take them temporary offline?
• How to add new ones or take them online again?
How do services find each other? => ServiceDiscovery
1001
Service Discovery by configuraFon
You (mis)use your configuraIon management tool (puppet/chef) to generate service configuraIon with explicit endpoints
19
ConfigManagement
Node2Node1 Node3
....................
....................
....................
....................
....................
....................
update
add/removenode entry
1001
Service Discovery by configuraFon
Pros: Simply works No extra technology involved (and thereby no extra point of failure)
Cons: To take a service node offline one has to update all of its consumers (or wait for them to be updated) All consumers have to be able to reload their configuraIon without restart
20
1001
Service Discovery by DNS
DNS is actually a service discovery
21
CustomerService
ProductService2
DNS
ProductService1
10.X.X.1
ProductService3
10.X.X.2 10.X.X.3
{10.X.X.1,10.X.X.2,10.X.X.3}
product.loc ?
1001
DNS
Pros: Old technology that is proven to work on a very very large scale Supported by almost everyone
Cons: Rather crude (and very inconvenient) interface (especially for updates) Resolved service names might be cached on mulIple levels Focusses purely on service nodes, not on the services itself
22
1001
Also: DNS might lead to wrong assumpFons 23
public void connect( ... ) throws IOException { ... final InetAddress[] addresses = this.dnsResolver.resolve(host.getHostName()); ... for (int i = 0; i < addresses.length; i++) { final InetAddress address = addresses[i]; final boolean last = i == addresses.length - 1; Socket sock = sf.createSocket(context); ... try { sock.setSoTimeout(socketConfig.getSoTimeout()); ... conn.bind(sock); return; } catch (final SocketTimeoutException ex) { ... } catch (final ConnectException ex) { ... } } }
Apache HttpClient 4.3: org.apache.http.impl.conn.HttpClientConnectionOperator
1001
Service discovery service
Zookeeper Curator consul etcd doozerd SkyDNS Eureka …
24
DiscoveryService
Customer
Product Cart
Billing
1001
EvaluaFon criteria
-‐ High availability -‐ Consistency (consistency over availability?) -‐ Support for service-‐level checks -‐ API (hYp, DNS?, …) -‐ Footprint (Memory/CPU) -‐ MulIple datacenter support -‐ UI -‐ Template engine (for config files)
25
1001
Consul 26
Consul server cluster
consulserver
consulserver
consulserver
consulserver
consulserver
Raftconsensus
service node A
consul agent
service
http / dns
consul template(optional) config files
write
restartreload
health check
service node A
consul agent
service
http / dns
consul template(optional) config files
write
restartreload
health check
1001
Consul 27
Consul server cluster
consulserver
consulserver
consulserver
consulserver
consulserver
Raftconsensus
service node A
consul agent
service
http / dns
consul template(optional) config files
write
restartreload
health check
service node A
consul agent
service
http / dns
consul template(optional) config files
write
restartreload
health check
1001
Consul
• Show consul UI • Show web-‐service status page • Add cart/remove cart
28
DEMO
1001
Challenge: Resilience 29
Every system has a retry!
1001
Do not do failover/retry on connecFon
GET ok DELETE ok PUT ok POST ???
Duplicates might be ok (e.g. create new shopping cart) … or not (e.g. register new customer) might be solved by a request token (e.g. the xsrf token from the web) as long as the service supports this
GET really ok? What about streaming?
30
1001
Failover should be part of the business
Usually the failover strategy depends on the concrete use-‐case Handling failover on the protocol layer (hYp-‐client) might hide error scenarios from the programmer It can be difficult to disInguish between technical and business error on the protocol layer
As a rule of thumb: You want to retry all technical error, but not the business errors … though even that is discussable in some cases
31
1001
How »not« to do service failover 32
public class ServiceFailover { private static final Random RANDOM = new Random(System.currentTimeMillis()); public static <E, R> R retry(final List<E> endpoints, final Requester<E, R> requester) throws IOException { final int size = endpoints.size(); if (size == 0) throw new RuntimeException("No active endpoints found"); final int offset = RANDOM.nextInt(size); IOException lastException = null; for (int idx = 0; idx < size; idx++) { final E endpoint = endpoints.get((idx + offset) % size); try { return requester.performTry(endpoint); } catch (IOException e) { lastException = e; } } throw lastException; } @FunctionalInterface public interface Requester<E, R> { R performTry(E endpoint) throws IOException; } }
1001
Resilience 33
• Kill 2 consul nodes • Kill one cart node •
DEMO
1001
Hystrix
Circuit-‐Breaking Fail-‐Early Developer’s are »forced« to think in commands with potenIal fallback result rather than REST-‐calls
34
1001
How it actually should look like 35O
bser
ve
Serv
ice T
imeo
ut
Requ
est D
rain
ing
Load Balancer
Mon
itor
Trac
e
Obs
erve
Failu
re A
ccru
al
Requ
est T
imeo
ut
Pool
Fail
Fast
Expiration Dispatcher
according to finagle
1001
What other people do
Nenlix Many libraries and tools that build up on top of each other RxJava/ReacIveX: ReacIve extension/ReacIve streaming
Based on neYy: RxNeYy Hystrix: Basic command system for circuit-‐breaking/fail-‐early Eureka: Service registry Ribbon: REST-‐Client with failover/service discovery based on Hystrix/Eureka …
36
1001
What other people do
TwiYer Services based on finagle (scala)
… which is itself based on neYy … which contains als the basics for for failover/retry/service-‐discovery/monitoring
Service-‐Discovery done via zookeeper, but can be adapted/extended to other tools Several frameworks/connectors build on top: finatra, async-‐mysql-‐connector …
37
100
Service discovery
Takeaway 38
• Helps a lot to realize … • … any kind of zero-‐downIme deployment strategy • … a self-‐healing micro-‐service jungle
• Does not create a fully resilient system by itself, even though it is the basis of it
• Might conflict with your exisIng configuraIon system (when creaIng config files via templates)
• Might be just another central component that fails
100
Failover/Retry
Takeaway 39
• The failover strategy usually depends on the business case
• A full failover stack is quite a piece of work • Emerging frameworks might make life easier or at least provide a reference implementaIon