voxxed vienna 2015 fault tolerant microservices

Post on 19-Jul-2015

411 Views

Category:

Software

1 Downloads

Preview:

Click to see full reader

TRANSCRIPT

@chbatey#Voxxed

Fault tolerant microservicesChristopher Batey

DataStax

@chbatey

Who am I?•DataStax-Technical Evangelist / Software Engineer- Builds enterprise ready version of Apache

Cassandra • Sky: Building next generation Internet TV

platform• Lots of time working on a test double for

Apache Cassandra

@chbatey

Agenda•Setting the scene•What do we mean by a fault?•What is a micro(ish)service?•Monolith application vs the micro(ish)service•A worked example•Identify an issue•Reproduce/test it•Show how to deal with the issue

So… what do applications look like?

So... what do systems look like now?

But different things go wrong...

down

slow network

slow app

SLA: 2 second max

missing packets

GC :(

PinService

Movie Player

UserService

DeviceService

Play Movie

Example: Movie player service

@chbatey

Time for an example...•All examples are on github•Technologies used:•Dropwizard•Spring Boot•Wiremock•Hystrix•Graphite•Saboteur

@chbatey

Testing microservices• You don’t know a service is fault tolerant

if you don’t test faults

Isolated service tests

Movie serviceMocks User

Device Pin service

Play Movie AcceptanceTest

Prime

Real HTTP/TCP

@chbatey

Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off

@chbatey

1 - Don’t take forever• If at first you don’t succeed, don’t take forever to tell someone• Timeout and fail fast

@chbatey

Which timeouts?• Socket connection timeout• Socket read timeout

Your service hung for 30 seconds :(

Customer

You :(

@chbatey

Which timeouts?• Socket connection timeout• Socket read timeout•Resource acquisition

Your service hung for 10 minutes :(

Let’s think about this

A little more detail

@chbatey

Wiremock + Saboteur + Vagrant•Vagrant - launches + provisions local VMs•Saboteur - uses tc, iptables to simulate network issues•Wiremock - used to mock HTTP dependencies•Cucumber - acceptance tests

I can write an automated test for that?

Wiremock:•User Service•Device Service•Pin Service

Saboteur

Vagrant + Virtual box VM

MovieService

AcceptanceTest

prime to drop traffic

reset

@chbatey

Implementing reliable timeouts• Protect the container thread!•Homemade: Worker Queue + Thread pool (executor)

@chbatey

Implementing reliable timeouts• Protect the container thread!•Homemade: Worker Queue + Thread pool (executor)•Hystrix• Spring cloud Netflix

A simple Spring RestController

@RestControllerpublic class Resource { private static final Logger LOGGER = LoggerFactory.getLogger(Resource.class); @Autowired private ScaryDependency scaryDependency; @RequestMapping("/scary") public String callTheScaryDependency() { LOGGER.info("Resource later: I wonder which thread I am on!"); return scaryDependency.getScaryString(); }}

Scary dependency

@Componentpublic class ScaryDependency { private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class); public String getScaryString() { LOGGER.info("Scary Dependency: I wonder which thread I am on! Tomcats?”); if (System.currentTimeMillis() % 2 == 0) { return "Scary String"; } else { Thread.sleep(5000) return “Slow Scary String"; } }}

All on the tomcat thread13:47:20.200 [http-8080-exec-1] INFO info.batey.examples.Resource - Resource later: I wonder which thread I am on!13:47:20.200 [http-8080-exec-1] INFO info.batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on! Tomcats?

Scary dependency@Componentpublic class ScaryDependency { private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class); @HystrixCommand() public String getScaryString() { LOGGER.info("Scary Dependency: I wonder which thread I am on! Tomcats?”); if (System.currentTimeMillis() % 2 == 0) { return "Scary String"; } else { Thread.sleep(5000) return “Slow Scary String"; } }}

What an annotation can do...13:51:21.513 [http-8080-exec-1] INFO info.batey.examples.Resource - Resource later: I wonder which thread I am on!13:51:21.614 [hystrix-ScaryDependency-1] INFO info.batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on! Tomcats? :P

@chbatey

Timeouts take home● You can’t use network level timeouts for SLAs● Test your SLAs - if someone says you can’t, hit them with a stick● Scary things happen without network issues

@chbatey

Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off

2 - Don’t try if you can’t succeed

Complexity

“When an application grows in complexity it will eventually start sending emails”

Complexity

“When an application grows in complexity it will eventually start using queues and thread pools”

Or use Akka :)

@chbatey

Don’t try if you can’t succeed

@chbatey

Don’t try if you can’t succeed• Executor Unbounded queues :(• newFixedThreadPool• newSingleThreadExecutor• newThreadCachedThreadPool• Bound your queues and threads• Fail quickly when the queue / maxPoolSize is met• Know your drivers

@chbatey

This is a functional requirement•Set the timeout very high•Use Wiremock to add a large delay to the requests

@chbatey

This is a functional requirement•Set the timeout very high•Use Wiremock to add a large delay to the requests•Set queue size and thread pool size to 1•Send in 2 requests to use the thread and fill the queue•What happens on the 3rd request?

@chbatey

Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off

3 - Fail gracefully

@chbatey

Expect rubbish•Expect invalid HTTP•Expect malformed response bodies•Expect connection failures•Expect huge / tiny responses

Testing with WiremockstubFor(get(urlEqualTo("/dependencyPath"))

.willReturn(aResponse()

.withFault(Fault.MALFORMED_RESPONSE_CHUNK)));

{ "request": { "method": "GET", "url": "/fault" }, "response": { "fault": "RANDOM_DATA_THEN_CLOSE" }

{ "request": { "method": "GET", "url": "/fault" }, "response": { "fault": "EMPTY_RESPONSE" } }

Stubbed Cassandra

@chbatey

Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off

4 - Know if it’s your fault

@chbatey

Record stuff•Metrics: - Timings - Errors- Concurrent incoming requests- Thread pool statistics- Connection pool statistics•Logging: Boundary logging, ElasticSearch / Logstash•Request identifiers

Graphite + Codahale

Response times

@chbatey

Separate resource pools•Don’t flood your dependencies•Be able to answer the questions:-How many connections will you make to dependency X?-Are you getting close to your max connections?

So easy with Dropwizard + Hystrix

metrics:

reporters:

- type: graphite

host: 192.168.10.120

port: 2003

prefix: shiny_app

@Overridepublic void initialize(Bootstrap<AppConfig> appConfigBootstrap) { HystrixCodaHaleMetricsPublisher metricsPublisher = new HystrixCodaHaleMetricsPublisher(appConfigBootstrap.getMetricRegistry()); HystrixPlugins.getInstance().registerMetricsPublisher(metricsPublisher);}

@chbatey

Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off

PinService

Movie Player

UserService

DeviceService

Play Movie

5 - Don’t whack a dead horse

@chbatey

What to do…•Yes this will happen…•Mandatory dependency - fail *really* fast•Throttling•Fallbacks

Circuit breaker pattern

Implementation with Hystrix

@Path("integrate") public class IntegrationResource { private static final Logger LOGGER = LoggerFactory.getLogger(IntegrationResource.class); @GET @Timed public String integrate() { LOGGER.info("integrate"); String user = new UserServiceDependency(userService).execute(); String device = new DeviceServiceDependency(deviceService).execute(); Boolean pinCheck = new PinCheckDependency(pinService).execute(); return String.format("[User info: %s] \n[Device info: %s] \n[Pin check: %s] \n", user, device, pinCheck); }}

Implementation with Hystrixpublic class PinCheckDependency extends HystrixCommand<Boolean> { private HttpClient httpClient; public PinCheckDependency(HttpClient httpClient) { super(HystrixCommandGroupKey.Factory.asKey("PinCheckService")); this.httpClient = httpClient; } @Override protected Boolean run() throws Exception { HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck"); HttpResponse pinCheckResponse = httpClient.execute(pinCheck); int statusCode = pinCheckResponse.getStatusLine().getStatusCode(); if (statusCode != 200) { throw new RuntimeException("Oh dear no pin check, status code " + statusCode); } String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity()); return Boolean.valueOf(pinCheckInfo); }}

Implementation with Hystrixpublic class PinCheckDependency extends HystrixCommand<Boolean> { private HttpClient httpClient; public PinCheckDependency(HttpClient httpClient) { super(HystrixCommandGroupKey.Factory.asKey("PinCheckService")); this.httpClient = httpClient; } @Override protected Boolean run() throws Exception { HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck"); HttpResponse pinCheckResponse = httpClient.execute(pinCheck); int statusCode = pinCheckResponse.getStatusLine().getStatusCode(); if (statusCode != 200) { throw new RuntimeException("Oh dear no pin check, status code " + statusCode); } String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity()); return Boolean.valueOf(pinCheckInfo); } @Override public Boolean getFallback() { return true; }}

@chbatey

Triggering the fallback•Error threshold percentage•Bucket of time for the percentage•Minimum number of requests to trigger•Time before trying a request again•Disable•Per instance statistics

@chbatey

Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off

@chbatey

6 - Turn off broken stuff• The kill switch

@chbatey

To recap1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off

@chbatey

Links• Examples:- https://github.com/chbatey/spring-cloud-example- https://github.com/chbatey/dropwizard-hystrix- https://github.com/chbatey/vagrant-wiremock-saboteur• Tech:- https://github.com/Netflix/Hystrix- https://www.vagrantup.com/- http://wiremock.org/- https://github.com/tomakehurst/saboteur

@chbatey

Questions?

Thanks for listening!Questions: @chbatey

http://christopher-batey.blogspot.co.uk/

@chbatey

Developer takeaways● Learn about TCP● Love vagrant, docker etc to enable testing● Don’t trust libraries

Hystrix cost - do this yourself

@chbatey

Hystrix metrics● Failure count● Percentiles from Hystrix point of view● Error percentages

@chbatey

How to test metric publishing?● Stub out graphite and verify calls?● Programmatically call graphite and verify numbers?● Make metrics + logs part of the story demo

top related