break me if you can...break me if you can practical guide to building fault-tolerant systems...

138
Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer @ Netflix Mykyta Protsenko, Software Engineer @ Netflix

Upload: others

Post on 06-Sep-2021

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Break Me If You CanPractical Guide to Building Fault-tolerant Systems

O'Reilly Open Source Conference, Portland OR, July 18, 2019

Alex Borysov, Software Engineer @ NetflixMykyta Protsenko, Software Engineer @ Netflix

Page 2: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Who are we?

Alex Borysov

Software Engineer @Netflix

Mykyta Protsenko

Software Engineer @Netflix

@aiborisov@mykyta_p

@WeAreNetflix

Page 3: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Fault-Tolerance?

@aiborisov@mykyta_p

Page 4: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Fault vs Error vs Failure

@aiborisov@mykyta_p

Page 5: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Fault@aiborisov@mykyta_p

incorrect internal state

Picture by Bob McMillan. Public domain. See slide #135 for details.

Page 6: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Error@aiborisov@mykyta_p

visibly incorrect behaviour

Picture by David Goehring. CC BY 2.0. See slide #135 for details.

Page 7: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Failure@aiborisov@mykyta_p

main functionality is broken

Picture by Camerafiend. CC BY-SA 3.0. See slide #135 for details.

Page 8: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

RMS Titanic vs Miracle on the Hudson@aiborisov@mykyta_p

Willy Stöwer. Public domain. See slide #135 for details. By Greg Lam Pak Ng. CC BY 2.0. See slide #136 for details.

Page 9: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

RMS Titanic@aiborisov@mykyta_p

Fault: Hitting an iceberg

Error: Water in the hull

Failure: Sinking

Willy Stöwer. Public domain. See slide #135 for details.

Page 10: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Miracle on the Hudson@aiborisov@mykyta_p

Fault: Hitting geese at 2818 ft

Error: Engines shut down

No Failure!

By Greg Lam Pak Ng. CC BY 2.0. See slide #136 for details.

Page 11: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Fault Error Failure

@aiborisov@mykyta_p

→ →

Page 12: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Fault Error Failure

@aiborisov@mykyta_p

→ →

Page 13: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Fault Tolerance@aiborisov@mykyta_p

Code and Design Patterns

Product-Driven Decisions

Communication

By Greg Lam Pak Ng. CC BY 2.0. See slide #136 for details.

Page 14: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Dodging Geese

@aiborisov@mykyta_p

Page 15: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Dodging Geese Architecture

TOP-5

Geese Service

Clouds Service

Leaderboard Service

APIGateway

@aiborisov@mykyta_p

See slides ##135, 136 for licensing details.

Page 16: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Dodging Geese Architecture

TOP-5

Geese Service

Clouds Service

Leaderboard Service

APIGateway

@aiborisov@mykyta_p

Page 17: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Dodging Geese Architecture

TOP-5

Geese Service

Leaderboard Service

APIGateway

@aiborisov@mykyta_p

Clouds Service

Page 18: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Dodging Geese Architecture

TOP-5

Leaderboard Service

APIGateway

@aiborisov@mykyta_p

Clouds Service

Geese Service

Page 19: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Dodging Geese Architecture

Geese Service

Clouds ServiceAPIGateway

@aiborisov@mykyta_p

TOP-5

Leaderboard Service

Page 20: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Dodging Geese Architecture

TOP-5

Geese Service

Clouds Service

Leaderboard Service

APIGateway

@aiborisov@mykyta_p

Page 21: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Dodging Geese Architecture

TOP-5

Geese Service

Clouds Service

Leaderboard Service

APIGateway

@aiborisov@mykyta_p

Page 22: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Dodging Geese Architecture

TOP-5

Geese Service

Clouds Service

Leaderboard Service

APIGateway

@aiborisov@mykyta_p

Page 23: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Leaderboard API (REST)

/players/<username>/score

{"name": "Jane", "score": 100}

/leaderboard/top/<n>

[{"name": "Jane", "score": 100}, {"name": "John", "score": 50}, ...]

@aiborisov@mykyta_p

Page 24: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

gRPC Service Definitions@aiborisov@mykyta_p

service GeeseService { // Return next line of geese. rpc GetGeese (GetGeeseRequest) returns (GeeseResponse);}

Page 25: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

gRPC Service Definitions@aiborisov@mykyta_p

service GeeseService { // Return next line of geese. rpc GetGeese (GetGeeseRequest) returns (GeeseResponse);}

service CloudsService { // Return next line of clouds. rpc GetClouds (GetCloudsRequest) returns (CloudsResponse);}

Page 26: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

service FixtureService { // Return next line of geese and clouds. rpc GetFixture (GetFixtureRequest) returns (FixtureResponse);}

gRPC Gateway Service@aiborisov@mykyta_p

Page 27: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

service FixtureService { // Return next line of geese and clouds. rpc GetFixture (GetFixtureRequest) returns (FixtureResponse);}

+ = Fixture

gRPC Gateway Service@aiborisov@mykyta_p

Page 28: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

public class FixtureService extends FixtureServiceImplBase {

Gateway Fixture Service@aiborisov@mykyta_p

Page 29: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Gateway Fixture Service

APIGateway

@aiborisov@mykyta_p

Geese Service

Clouds Service

Page 30: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Non-Blocking Calls

Don’t block

Send requests in parallel

Combine results when ready

Page 31: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

public class FixtureService extends FixtureServiceImplBase {

Gateway Service Implementation@aiborisov@mykyta_p

private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...;

Page 32: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

public class FixtureService extends FixtureServiceImplBase {

Gateway Service Implementation@aiborisov@mykyta_p

private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...;

@Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) {

ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request));

ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...

Page 33: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

public class FixtureService extends FixtureServiceImplBase {

Gateway Service Implementation@aiborisov@mykyta_p

private final GeeseServiceFutureStub geeseClient = ...; private final CloudsServiceFutureStub cloudsClient = ...;

@Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) {

ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request)); ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request));

ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...

Page 34: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 35: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Slow dependencies

Slow upstream services

Page 36: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Timeouts

Guaranteed latency

for integration points

Page 37: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

public class FixtureService extends FixtureServiceImplBase {

...

Gateway Service Implementation@aiborisov@mykyta_p

@Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) {

ListenableFuture<GeeseResponse> geese = geeseClient.getGeese(toGeese(request));

ListenableFuture<CloudsResponse> clouds = cloudsClient.getClouds(toClouds(request));

ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...

Page 38: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

public class FixtureService extends FixtureServiceImplBase {

...

Gateway Service Implementation@aiborisov@mykyta_p

@Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) {

ListenableFuture<GeeseResponse> geese = geeseClient.withDeadlineAfter(500, MILLISECONDS).getGeese(toGeeseRequest(request));

ListenableFuture<CloudsResponse> clouds = cloudsClient.withDeadlineAfter(500, MILLISECONDS).getClouds(toCloudsRequest(request));

ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...

Page 39: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

@Override public void getFixture(GetFixtureRequest request, StreamObserver<FixtureResponse> response) {

ListenableFuture<GeeseResponse> geese = geeseClient.withDeadlineAfter(500, MILLISECONDS).getGeese(toGeeseRequest(request));

ListenableFuture<CloudsResponse> clouds = cloudsClient.withDeadlineAfter(500, MILLISECONDS).getClouds(toCloudsRequest(request));

ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds); ...

public class FixtureService extends FixtureServiceImplBase {

...

Gateway Service Implementation@aiborisov@mykyta_p

Page 40: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

REST: Non-Blocking Calls

CompletableFuture<List<LeaderboardEntry>> leaderboard = httpClient .get().uri("/top/5") .exchange() .timeout(Duration.ofMillis(500)) .flatMap(cr -> cr.bodyToMono(...)) .toFuture();

@aiborisov@mykyta_p

Page 41: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

REST: Non-Blocking Calls with Timeout

CompletableFuture<List<LeaderboardEntry>> leaderboard = httpClient .get().uri("/top/5") .exchange() .timeout(Duration.ofMillis(500)) .flatMap(cr -> cr.bodyToMono(...)) .toFuture();

@aiborisov@mykyta_p

Page 42: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 43: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Demo

@aiborisov@mykyta_p

Page 44: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

No Geese

No Clouds

Blinking Leaderboard

Page 45: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Observability

Monitoring: QPS, latency, errors, ...

Page 46: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Observability: gRPC

Monitoring: QPS, latency, errors, ...

// OpenCensusRpcViews.registerAllViews();

Page 47: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Tracing: gRPC

GrpcTracing grpcTracing = GrpcTracing.create(...);

ManagedChannelBuilder ... .intercept(grpcTracing.newClientInterceptor()) .build() ;

ServerBuilder.forPort(8080) ... .intercept(grpcTracing.newServerInterceptor()) .build();

Page 48: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Tracing: gRPC

GrpcTracing grpcTracing = GrpcTracing.create(...);

ManagedChannelBuilder ... .intercept(grpcTracing.newClientInterceptor()) .build();

ServerBuilder.forPort(8080) ... .intercept(grpcTracing.newServerInterceptor()) .build();

Page 49: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Tracing: REST

build.gradle:dependencies { compile '...:spring-cloud-sleuth-zipkin' compile '...:spring-cloud-starter-sleuth' ...}

application.properties:spring.zipkin.baseUrl=http://zipkin:9411/spring.sleuth.sampler.probability=1.0spring.sleuth.web.enabled=true

Page 50: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 51: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Demo

@aiborisov@mykyta_p

Page 52: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Clouds are slow

Geese are fast

Entire call fails

Page 53: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

ListenableFuture<GeeseResponse> geese = geeseClient..getGeese(toGeese(request));

ListenableFuture<CloudsResponse> clouds =cloudsClient.getClouds(toClouds(request));

ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.allAsList(geese, clouds);

...

@aiborisov@mykyta_p

Partial Degradation

Page 54: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Partial Degradation

ListenableFuture<GeeseResponse> geese = geeseClient..getGeese(toGeese(request));

ListenableFuture<CloudsResponse> clouds =cloudsClient.getClouds(toClouds(request));

ListenableFuture<List<GeneratedMessageV3>> geeseAndClouds = Futures.successfulAsList(geese, clouds);

...

Page 55: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 56: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Some L-board calls fail

L-board latency is low

Scores disappear

Page 57: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

CompletableFuture<List<Leaderboard>> request() { return httpClient .get().uri("/top/5").exchange() .timeout(Duration.ofMillis(500)) .flatMap(...).toFuture();}

@aiborisov@mykyta_p

Retries: REST

Page 58: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

CompletableFuture<List<Leaderboard>> request() { return httpClient .get().uri("/top/5").exchange() .timeout(Duration.ofMillis(500)) .flatMap(...).toFuture();}

RetryPolicy RETRY_POLICY = new RetryPolicy() .retryOn(IOException.class) .withMaxRetries(MAX_RETRIES);

CompletableFuture<List<Leaderboard>> top5 = Failsafe.with(RETRY_POLICY) ... .future(this::httpRequest);

@aiborisov@mykyta_p

Retries: REST

Page 59: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 60: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Demo

@aiborisov@mykyta_p

Page 61: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Retry slow calls?

Retry failed calls?

Retry network faults?

Page 62: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Retry Storm

Clouds ServiceAPIGateway

@aiborisov@mykyta_p

Page 63: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

new RetryPolicy() .withBackoff( MIN_DELAY, MAX_DELAY, TimeUnit.MILLISECONDS, 100.0) ...

...

@aiborisov@mykyta_p

Exponential Backoffs

Page 64: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

CircuitBreaker CIRCUIT_BREAKER = new CircuitBreaker() .withFailureThreshold(...);

Failsafe .with(CIRCUIT_BREAKER) .withFallback( () -> emptyLeaderboard()) ...

@aiborisov@mykyta_p

Fallbacks

Page 65: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

CircuitBreaker CIRCUIT_BREAKER = new CircuitBreaker() .withFailureThreshold(...);

Failsafe .with(CIRCUIT_BREAKER) () -> cachedLeaderboard()) ...

@aiborisov@mykyta_p

Fallbacks

Page 66: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Retry

Fallback

Fail Fast

@aiborisov@mykyta_p

On Error

Page 67: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 68: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Page 69: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

High 99%ile latency

100 requests

Error probability?

Page 70: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

High 99%ile latency

100 requests

Error probability:

1 – 0.99^100 = 63%

Page 71: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Tail-Tolerance@aiborisov@mykyta_p

Request200 ms deadline

Page 72: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Tail-Tolerance@aiborisov@mykyta_p

Request200 ms deadline

↓ 100 ms

Page 73: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Tail-Tolerance@aiborisov@mykyta_p

Request200 ms deadline

↓ 100 ms

Request

Page 74: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Tail-Tolerance@aiborisov@mykyta_p

Request200 ms deadline

↓ 100 ms

Request

Fastest Response

Page 75: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

High 99%ile latency

100 requests

@aiborisov@mykyta_p

Request Hedging

Page 76: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

High 99%ile latency

100 requests

Error probability:

63% x 0.01 < 1%

@aiborisov@mykyta_p

Request Hedging

Page 77: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Channel geeseChannel = ManagedChannelBuilder .forAddress(geeseHost, geesePort) .enableRetry() .maxHedgedAttempts(MAX_HEDGES) .build();

GeeseServiceFutureStub geeseStub = GeeseServiceGrpc .newFutureStub(geeseChannel);

@aiborisov@mykyta_p

Hedging in gRPC (soon)

Page 78: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Channel geeseChannel = ManagedChannelBuilder .forAddress(geeseHost, geesePort) .enableRetry() .maxHedgedAttempts(MAX_HEDGES) .build();

GeeseServiceFutureStub geeseStub = GeeseServiceGrpc .newFutureStub(geeseChannel);

@aiborisov@mykyta_p

Hedging in gRPC (soon)

Page 79: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 80: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Error Handling

Consistent Faults Fail Fast

Intermittent Slow Hedging

Intermittent Fast Retry

Fallback✚

Page 81: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 82: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Client-driven deadline

Don’t process failed calls

Page 83: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Deadlines

APIGateway

@aiborisov@mykyta_p

See slides ##135, 136 for licensing details.

Page 84: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Deadlines

APIGateway

@aiborisov@mykyta_p

Deadline 200 ms

Page 85: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Deadlines

APIGateway

@aiborisov@mykyta_p

Deadline 200 ms

→ Spent 120 ms

Page 86: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Deadlines

APIGateway

@aiborisov@mykyta_p

Spent 120 ms → Spent 90 ms

Deadline 200 ms

X

Page 87: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Deadlines

APIGateway

@aiborisov@mykyta_p

Spent 120 ms → Spent 90 ms

Deadline 200 ms

X

Page 88: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Deadline Propagation

APIGateway

@aiborisov@mykyta_p

Deadline 200 ms

Page 89: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Deadline 80 ms

Deadline Propagation

APIGateway

@aiborisov@mykyta_p

Deadline 200 ms

→ Spent 120 ms

Page 90: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Deadline 80 ms

Deadline Propagation

APIGateway

@aiborisov@mykyta_p

Spent 120 ms

→ Spent 90 ms

Deadline 200 ms

X

Page 91: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Deadline 80 ms

Deadline Propagation

APIGateway

@aiborisov@mykyta_p

Spent 120 ms

→ Spent 90 ms Deadline -10 ms

Deadline 200 ms

X

Page 92: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 93: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Throughput has limits

Exceeding limits?

Page 94: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

new ConcurrencyLimitServletFilter( new ServletLimiterBuilder() .partitionByHeader("GEESE_TYPE", c -> c.assign("premium", 0.9) .assign("free", 0.1)) .limiter(l -> l.limit(

newBuilder().initialLimit(1000)...);

@aiborisov@mykyta_p

REST

Page 95: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

new ConcurrencyLimitServletFilter( new ServletLimiterBuilder() .partitionByHeader("GEESE_TYPE", c -> c.assign("premium", 0.9) .assign("free", 0.1)) .limiter(l -> l.limit(

newBuilder().initialLimit(1000)...);

@aiborisov@mykyta_p

REST

Page 96: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

var limiter = new GrpcServerLimiterBuilder() .partitionByHeader(GEESE_TYPE) .partition("premium", 0.9) .partition("free", 0.1) .limiter(l -> l.limit(

newBuilder().initialLimit(1000)...);

ConcurrencyLimitServerInterceptor .newBuilder(limiter).build();

@aiborisov@mykyta_p

gRPC: Server

Page 97: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

var limiter = new GrpcServerLimiterBuilder() .partitionByHeader(GEESE_TYPE) .partition("premium", 0.9) .partition("free", 0.1) .limiter(l -> l.limit(

newBuilder().initialLimit(1000)...);

ConcurrencyLimitServerInterceptor .newBuilder(limiter).build();

@aiborisov@mykyta_p

gRPC: Server

Page 98: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

new GrpcClientLimiterBuilder() .limit( newBuilder() .initialLimit(1000).build()) .blockOnLimit(false) // fail-fast .build();

@aiborisov@mykyta_p

gRPC: Client

Page 99: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 100: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Demo

@aiborisov@mykyta_p

Page 101: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Demo

@aiborisov@mykyta_p

Page 102: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Monitoring@aiborisov@mykyta_p

APM

Service metrics

Distributed tracing

Business metrics

Picture by Alex Borysov. CC BY 2.0. See slide #136 for details.

Page 103: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Code and Design

Timeouts / Deadline Propagation

Retries / Hedging

Proper Fallbacks

Concurrency Limits

Load Shedding

Observability

Page 104: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

Demo

@aiborisov@mykyta_p

Page 105: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Bad user experience

Metrics are not enough

Page 106: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Prober

TOP-5

APIGateway

@aiborisov@mykyta_p

Page 107: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Prober

TOP-5

APIGateway

@aiborisov@mykyta_p

See slides ##135, 137 for licensing details.

Page 108: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Prober

Availability

Latency SLO

Response verification

Page 109: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Prober

Availability

Latency SLO

Response verification

CloudProber.org

Page 110: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 111: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 112: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 113: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Technical solutions are not enough

Page 114: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Communication@aiborisov@mykyta_p

Page 115: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Communication@aiborisov@mykyta_p

Page 116: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Communication Channels@aiborisov@mykyta_p

GEESEat 270

Page 117: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Communication Channels@aiborisov@mykyta_p

GEESEat 270

Page 118: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

GEESEat 270

Communication Channels@aiborisov@mykyta_p

Page 119: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

GEESEat 270

Communication Channels@aiborisov@mykyta_p

Page 120: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Postmortems@aiborisov@mykyta_p

Blameless

Constructive

Page 121: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Postmortems@aiborisov@mykyta_p

Blameless

Constructive

Social

See slides ##135, 137 for licensing details.

Page 122: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Postmortems@aiborisov@mykyta_p

Timeline

Causes

Remedies

See slides ##135, 137 for licensing details.

Page 123: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Learn from Failure

Blameless postmortems

Alert playbooks

Incident knowledge base

Page 124: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 125: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Libraries and Tools@aiborisov@mykyta_p

Demo: github.com/break-me-if-you-can

Failsafe: github.com/jhalterman/failsafe

Observability: opencensus.io + opentracing.io = opentelemetry.io

Prober: cloudprober.org

Concurrency Limits: github.com/Netflix/concurrency-limits

Page 126: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Demo UI@HalloGene_

Yevgen Golubenko

Twitter: @HalloGene_

github.com/HalloGene

Picture by Yevgen Golubenko. Also see slide #138 for licensing details.

Page 127: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Books@aiborisov@mykyta_p

Page 128: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p@aiborisov@mykyta_p

Fault-Tolerance

Code & Design Patterns

Product decisions

Communication culture

Page 129: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 130: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Please Rate Our Talk

Conference Website O’Reilly Events App

Page 131: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Please Rate Our Talk

Conference Website O’Reilly Events App

See slides ##135, 137 for licensing details.

Page 132: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Please Rate Our Talk

Conference Website O’Reilly Events App

See slides ##135, 137 for licensing details.

Page 133: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Please Rate Our Talk

Conference Website O’Reilly Events App

5 STARS!

See slides ##135, 137 for licensing details.

Page 134: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Page 135: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Images and LicensingImages of geese, clouds, pilots, plane, arrows, cup, airport traffic control tower are property of Mykyta Protsenko and Alex Borysov, if not stated otherwise (see below). All Rights Reserved.

Other images used:

Slide #5: commons.wikimedia.org/wiki/File:FEMA_-_16381_-_Photograph_by_Bob_McMillan_taken_on_09-28-2005_in_Texas.jpg - Picture by Bob McMillan, the US federal government work, public domain

Slide #6: www.flickr.com/photos/carbonnyc/3290528875 - Picture by David Goehring. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0 - changes were made

Slide #7: www.flickr.com/photos/carbonnyc/3290528875 - Picture by Camerafiend. Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0): creativecommons.org/licenses/by-sa/3.0/deed.en - no changes were made

Slides ##8, 9: commons.wikimedia.org/wiki/File:Titanic_sinking,_painting_by_Willy_St%C3%B6wer.jpg - Willy Stöwer. Public domain work of art

Page 136: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Images and LicensingSlides ##8, 10, 13: www.flickr.com/photos/22608787@N00/3200086900 - Picture y Greg Lam Pak Ng. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0 - no changes were made

Slides ##15-22, 29, 62, 71-74, 83-91, 106-107: - Blue Game Boy Color by kure: piq.codeus.net/picture/31994/Blue-Game-Boy-Color - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made

Slides ##83-91: - The Sun by Vinicius615: piq.codeus.net/picture/191706/The-Sun - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made

Slide #102: - Picture by Alex Borysov. Attribution 2.0 Generic (CC BY 2.0): creativecommons.org/licenses/by/2.0

Page 137: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Images and LicensingSlides #107, ##131-133: piq.codeus.net/picture/423109/UFO - UFO by anonymous - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - no changes were made

Slides ##121, 122: piq.codeus.net/picture/334023/beer - beer by Investa - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made

Slides #121, 122: piq.codeus.net/picture/444498/Beer-Bottle - Beer Bottle by jacklrj - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made

Page 138: Break Me If You Can...Break Me If You Can Practical Guide to Building Fault-tolerant Systems O'Reilly Open Source Conference, Portland OR, July 18, 2019 Alex Borysov, Software Engineer

@aiborisov@mykyta_p

Images and LicensingSlide #126: https://piq.codeus.net/picture/330338/Deal-With-It - Deal With It by Shiro - Attribution 3.0 Unported (CC BY 3.0): creativecommons.org/licenses/by/3.0 - changes were made