tuning speculative retries to fight latency (michael figuiere, minh do, netflix) | cassandra summit...

21
Tuning Speculative Retries to Fight Latency Minh Do - Senior Distributed System Engineer Michaël Figuière - Senior Distributed System Engineer

Upload: datastax

Post on 16-Apr-2017

284 views

Category:

Software


4 download

TRANSCRIPT

Page 1: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Tuning Speculative Retriesto Fight Latency

Minh Do - Senior Distributed System Engineer Michaël Figuière - Senior Distributed System Engineer

Page 2: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Agenda

What are Speculative Retries?

C* Setting & Internal Code

C* Testing and Outcome

Datastax Java Driver

Page 3: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Speculative Retries

• “The Tail at Scale” - Google paper• Hedged Requests vs. Speculative Retries

Page 4: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Speculative Retries

• Tuning available both in C* and Java Driver

• It is all about the additional read retries to improve the latency tail

Page 5: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

C* Setting & Internals

• Read path only • Set at table level:

– None, Always, Custom, Percentile• Using different executors for different behaviors• Impacted by Read Repair (for simplicity, ignore in

this talk)

Page 6: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

No Retry Policy

• Use NeverSpeculatingReadExecutor• Coordinator computes a list of alive replicas:

– #blockFor• Sends only one full data request to one replica• Sends additional digest request(s) to others

depending on Consistency Level (CL)

Page 7: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

No Retry Policy

Example using CL_LOCAL_QUORUM (replication factor 3)

Page 8: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Always Retry Policy

• Use AlwaysSpeculatingReadExecutor• List of replicas: (#blockFor + 1) or all nodes• Coordinator sends 2 full data requests to 2

replicas• Sends additional digest requests to others

depending on Consistency Level

Page 9: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Always Retry Policy

Example using CL_ONE

Page 10: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Custom/Percentile Retry Policy

• Use SpeculatingReadExecutor• List of replicas: #blockFor + 1• Send 1 full data request and 1 or more digest request to

replicas depending Consistency Level• Use the last replicas on the list for the retries• Coordinator waits for a duration to retry on another replica

– Custom: duration in millisecs– Percentile: the percentile on the sampling latency

distribution

Page 11: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Custom/Percentile Retry Policy

Example using CL_LOCAL_QUORUM (replication factor 3)

Page 12: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

C* test - setup

• Cassandra: 12 nodes (8 cores/60Gb RAM), approx 120Gb data per node

• Client: 6 (8 cores/30 Gb RAM)• Use Linux Traffic Control (tc) to simulate network glitch on

one C* node using delayed packet transmission technique

Page 13: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

C* test - result

• Clients send 50K reads/sec with 3K writes/sec, • Use TC to force a slow node • Enable 95th Speculative Retry setting• Throughput degraded to 30K reads/sec on Coord.• Avg. latency gets doubled 0.5ms to 1ms• 95th/99th latencies have many spikes to 2x-10x

Page 14: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

C* test - result

• Client 18K reads/sec with 1K writes/sec• Use TC to force a slow node • Keep 95th Speculative Retry setting on• Throughput increased to 20+K reads/sec• Avg. latency gets doubled 0.5ms to 1ms• 95th/99th latencies are lower, no spikes or more

stable

Page 15: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

C* test - conclusion

• More stable and lower 95th/99th latencies: – Build cluster with extra capacities and turn on

Speculative Retry• C* cluster is already near its max capacity

– Disable Speculative Retry as performance getting worse

Page 16: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Client Speculative Retries

• SpeculativeExecutionPolicy

• Speculative retries for both reads and writes

• But only for idempotent requests

Page 17: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Client Speculative Retries

Example using CL_LOCAL_QUORUM (replication factor 3)

Page 18: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

NoSpeculativeExecutionPolicy

• Just the normal Driver behavior

• Retries are attempted only after a fixed timeout as part of Failover

Page 19: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

ConstantSpeculativeExecutionPolicy

Trigger a Speculative Retry if the Coordinator hasn’t answer within a given delay.

Page 20: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

PercentileSpeculativeExecutionPolicy

Trigger speculative retries if the chosen coordinator doesn’t answer within a given

percentile of its typical response time

Page 21: Tuning Speculative Retries to Fight Latency (Michael Figuiere, Minh Do, Netflix) | Cassandra Summit 2016

Questions

?