voxxed vienna 2015 fault tolerant microservices
TRANSCRIPT
@chbatey#Voxxed
Fault tolerant microservicesChristopher Batey
DataStax
@chbatey
Who am I?•DataStax-Technical Evangelist / Software Engineer- Builds enterprise ready version of Apache
Cassandra • Sky: Building next generation Internet TV
platform• Lots of time working on a test double for
Apache Cassandra
@chbatey
Agenda•Setting the scene•What do we mean by a fault?•What is a micro(ish)service?•Monolith application vs the micro(ish)service•A worked example•Identify an issue•Reproduce/test it•Show how to deal with the issue
So… what do applications look like?
So... what do systems look like now?
But different things go wrong...
down
slow network
slow app
SLA: 2 second max
missing packets
GC :(
PinService
Movie Player
UserService
DeviceService
Play Movie
Example: Movie player service
@chbatey
Time for an example...•All examples are on github•Technologies used:•Dropwizard•Spring Boot•Wiremock•Hystrix•Graphite•Saboteur
@chbatey
Testing microservices• You don’t know a service is fault tolerant
if you don’t test faults
Isolated service tests
Movie serviceMocks User
Device Pin service
Play Movie AcceptanceTest
Prime
Real HTTP/TCP
@chbatey
Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
@chbatey
1 - Don’t take forever• If at first you don’t succeed, don’t take forever to tell someone• Timeout and fail fast
@chbatey
Which timeouts?• Socket connection timeout• Socket read timeout
Your service hung for 30 seconds :(
Customer
You :(
@chbatey
Which timeouts?• Socket connection timeout• Socket read timeout•Resource acquisition
Your service hung for 10 minutes :(
Let’s think about this
A little more detail
@chbatey
Wiremock + Saboteur + Vagrant•Vagrant - launches + provisions local VMs•Saboteur - uses tc, iptables to simulate network issues•Wiremock - used to mock HTTP dependencies•Cucumber - acceptance tests
I can write an automated test for that?
Wiremock:•User Service•Device Service•Pin Service
Saboteur
Vagrant + Virtual box VM
MovieService
AcceptanceTest
prime to drop traffic
reset
@chbatey
Implementing reliable timeouts• Protect the container thread!•Homemade: Worker Queue + Thread pool (executor)
@chbatey
Implementing reliable timeouts• Protect the container thread!•Homemade: Worker Queue + Thread pool (executor)•Hystrix• Spring cloud Netflix
A simple Spring RestController
@RestControllerpublic class Resource { private static final Logger LOGGER = LoggerFactory.getLogger(Resource.class); @Autowired private ScaryDependency scaryDependency; @RequestMapping("/scary") public String callTheScaryDependency() { LOGGER.info("Resource later: I wonder which thread I am on!"); return scaryDependency.getScaryString(); }}
Scary dependency
@Componentpublic class ScaryDependency { private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class); public String getScaryString() { LOGGER.info("Scary Dependency: I wonder which thread I am on! Tomcats?”); if (System.currentTimeMillis() % 2 == 0) { return "Scary String"; } else { Thread.sleep(5000) return “Slow Scary String"; } }}
All on the tomcat thread13:47:20.200 [http-8080-exec-1] INFO info.batey.examples.Resource - Resource later: I wonder which thread I am on!13:47:20.200 [http-8080-exec-1] INFO info.batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on! Tomcats?
Scary dependency@Componentpublic class ScaryDependency { private static final Logger LOGGER = LoggerFactory.getLogger(ScaryDependency.class); @HystrixCommand() public String getScaryString() { LOGGER.info("Scary Dependency: I wonder which thread I am on! Tomcats?”); if (System.currentTimeMillis() % 2 == 0) { return "Scary String"; } else { Thread.sleep(5000) return “Slow Scary String"; } }}
What an annotation can do...13:51:21.513 [http-8080-exec-1] INFO info.batey.examples.Resource - Resource later: I wonder which thread I am on!13:51:21.614 [hystrix-ScaryDependency-1] INFO info.batey.examples.ScaryDependency - Scary Dependency: I wonder which thread I am on! Tomcats? :P
@chbatey
Timeouts take home● You can’t use network level timeouts for SLAs● Test your SLAs - if someone says you can’t, hit them with a stick● Scary things happen without network issues
@chbatey
Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
2 - Don’t try if you can’t succeed
Complexity
“When an application grows in complexity it will eventually start sending emails”
Complexity
“When an application grows in complexity it will eventually start using queues and thread pools”
Or use Akka :)
@chbatey
Don’t try if you can’t succeed
@chbatey
Don’t try if you can’t succeed• Executor Unbounded queues :(• newFixedThreadPool• newSingleThreadExecutor• newThreadCachedThreadPool• Bound your queues and threads• Fail quickly when the queue / maxPoolSize is met• Know your drivers
@chbatey
This is a functional requirement•Set the timeout very high•Use Wiremock to add a large delay to the requests
@chbatey
This is a functional requirement•Set the timeout very high•Use Wiremock to add a large delay to the requests•Set queue size and thread pool size to 1•Send in 2 requests to use the thread and fill the queue•What happens on the 3rd request?
@chbatey
Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
3 - Fail gracefully
@chbatey
Expect rubbish•Expect invalid HTTP•Expect malformed response bodies•Expect connection failures•Expect huge / tiny responses
Testing with WiremockstubFor(get(urlEqualTo("/dependencyPath"))
.willReturn(aResponse()
.withFault(Fault.MALFORMED_RESPONSE_CHUNK)));
{ "request": { "method": "GET", "url": "/fault" }, "response": { "fault": "RANDOM_DATA_THEN_CLOSE" }
{ "request": { "method": "GET", "url": "/fault" }, "response": { "fault": "EMPTY_RESPONSE" } }
Stubbed Cassandra
@chbatey
Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
4 - Know if it’s your fault
@chbatey
Record stuff•Metrics: - Timings - Errors- Concurrent incoming requests- Thread pool statistics- Connection pool statistics•Logging: Boundary logging, ElasticSearch / Logstash•Request identifiers
Graphite + Codahale
Response times
@chbatey
Separate resource pools•Don’t flood your dependencies•Be able to answer the questions:-How many connections will you make to dependency X?-Are you getting close to your max connections?
So easy with Dropwizard + Hystrix
metrics:
reporters:
- type: graphite
host: 192.168.10.120
port: 2003
prefix: shiny_app
@Overridepublic void initialize(Bootstrap<AppConfig> appConfigBootstrap) { HystrixCodaHaleMetricsPublisher metricsPublisher = new HystrixCodaHaleMetricsPublisher(appConfigBootstrap.getMetricRegistry()); HystrixPlugins.getInstance().registerMetricsPublisher(metricsPublisher);}
@chbatey
Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
PinService
Movie Player
UserService
DeviceService
Play Movie
5 - Don’t whack a dead horse
@chbatey
What to do…•Yes this will happen…•Mandatory dependency - fail *really* fast•Throttling•Fallbacks
Circuit breaker pattern
Implementation with Hystrix
@Path("integrate") public class IntegrationResource { private static final Logger LOGGER = LoggerFactory.getLogger(IntegrationResource.class); @GET @Timed public String integrate() { LOGGER.info("integrate"); String user = new UserServiceDependency(userService).execute(); String device = new DeviceServiceDependency(deviceService).execute(); Boolean pinCheck = new PinCheckDependency(pinService).execute(); return String.format("[User info: %s] \n[Device info: %s] \n[Pin check: %s] \n", user, device, pinCheck); }}
Implementation with Hystrixpublic class PinCheckDependency extends HystrixCommand<Boolean> { private HttpClient httpClient; public PinCheckDependency(HttpClient httpClient) { super(HystrixCommandGroupKey.Factory.asKey("PinCheckService")); this.httpClient = httpClient; } @Override protected Boolean run() throws Exception { HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck"); HttpResponse pinCheckResponse = httpClient.execute(pinCheck); int statusCode = pinCheckResponse.getStatusLine().getStatusCode(); if (statusCode != 200) { throw new RuntimeException("Oh dear no pin check, status code " + statusCode); } String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity()); return Boolean.valueOf(pinCheckInfo); }}
Implementation with Hystrixpublic class PinCheckDependency extends HystrixCommand<Boolean> { private HttpClient httpClient; public PinCheckDependency(HttpClient httpClient) { super(HystrixCommandGroupKey.Factory.asKey("PinCheckService")); this.httpClient = httpClient; } @Override protected Boolean run() throws Exception { HttpGet pinCheck = new HttpGet("http://localhost:9090/pincheck"); HttpResponse pinCheckResponse = httpClient.execute(pinCheck); int statusCode = pinCheckResponse.getStatusLine().getStatusCode(); if (statusCode != 200) { throw new RuntimeException("Oh dear no pin check, status code " + statusCode); } String pinCheckInfo = EntityUtils.toString(pinCheckResponse.getEntity()); return Boolean.valueOf(pinCheckInfo); } @Override public Boolean getFallback() { return true; }}
@chbatey
Triggering the fallback•Error threshold percentage•Bucket of time for the percentage•Minimum number of requests to trigger•Time before trying a request again•Disable•Per instance statistics
@chbatey
Fault tolerance1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
@chbatey
6 - Turn off broken stuff• The kill switch
@chbatey
To recap1.Don’t take forever - Timeouts2.Don’t try if you can’t succeed 3.Fail gracefully 4.Know if it’s your fault5.Don’t whack a dead horse6.Turn broken stuff off
@chbatey
Links• Examples:- https://github.com/chbatey/spring-cloud-example- https://github.com/chbatey/dropwizard-hystrix- https://github.com/chbatey/vagrant-wiremock-saboteur• Tech:- https://github.com/Netflix/Hystrix- https://www.vagrantup.com/- http://wiremock.org/- https://github.com/tomakehurst/saboteur
@chbatey
Questions?
Thanks for listening!Questions: @chbatey
http://christopher-batey.blogspot.co.uk/
@chbatey
Developer takeaways● Learn about TCP● Love vagrant, docker etc to enable testing● Don’t trust libraries
Hystrix cost - do this yourself
@chbatey
Hystrix metrics● Failure count● Percentiles from Hystrix point of view● Error percentages
@chbatey
How to test metric publishing?● Stub out graphite and verify calls?● Programmatically call graphite and verify numbers?● Make metrics + logs part of the story demo