designing distributed systems using approximate synchrony...

60
Designing Distributed Systems using Approximate Synchrony in Data Center Networks Dan R. K. Ports Jialin Li Vincent Liu Naveen Kr. Sharma Arvind Krishnamurthy University of Washington CSE

Upload: others

Post on 25-Sep-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Designing Distributed Systems using Approximate Synchrony

in Data Center NetworksDan R. K. Ports

Jialin Li Vincent Liu Naveen Kr. Sharma Arvind Krishnamurthy

University of Washington CSE

Page 2: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Today’s most popular applications are distributed systems in the data center

Page 3: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Today’s most popular applications are distributed systems in the data center

Modern data center: ~50,000 commodity servers

constant server failures

Page 4: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

How do we program the data center? Use distributed algorithms to tolerate failures, inconsistencies

Example: Paxos state machine replication

Page 5: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Distributed systems and networks are typically designed independently

??

?

Page 6: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Distributed systems and networks are typically designed independently

asynchronous network (Internet)packets may be arbitrarily

• dropped

• delayed

• reordered

??

?

Page 7: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Distributed systems and networks are typically designed independently

Page 8: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Distributed systems and networks are typically designed independently

Data center networks are different!

Page 9: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Data center networks are more predictable • known topology, routes, predictable latencies

Data center networks are more reliable

Data center networks are extensible • single administrative domain makes changes possible

• software-defined networking exposes sophisticated line-rate processing capability

Data Center Networks Are Different

Page 10: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Data center networks are more predictable • known topology, routes, predictable latencies

Data center networks are more reliable

Data center networks are extensible • single administrative domain makes changes possible

• software-defined networking exposes sophisticated line-rate processing capability

We should co-design distributed systems and data center networks!

Data Center Networks Are Different

Page 11: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Co-Designing Networks and Distributed Systems

Design the data center network to support distributed applications

Design distributed applications around the properties of thedata center network

Page 12: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

This TalkA concrete instantiation:improving replication performance using Speculative Paxos and Mostly-Ordered Multicast

Page 13: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

This TalkA concrete instantiation:improving replication performance using Speculative Paxos and Mostly-Ordered Multicast

new replicationprotocol

new network primitive

Page 14: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

This TalkA concrete instantiation:improving replication performance using Speculative Paxos and Mostly-Ordered Multicast

3x throughput and 40% lower latency than conventional approach

new replicationprotocol

new network primitive

Page 15: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Outline1. Co-designing Distributed Systems and

Data Center Networks

2. Background: State Machine Replication & Paxos

3. Mostly-Ordered Multicast and Speculative Paxos

4. Evaluation

Page 16: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

State Machine ReplicationUsed to tolerate failures in datacenter applications

• keep critical management services online(e.g., Google’s Chubby, Zookeeper)

• persistent storage in distributed databases (e.g., Spanner, H-Store)

Strongly consistent (linearizable) replication, i.e.,

all replicas execute same operations in same order …even when up to half replicas fail…even when messages are lost

Page 17: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Example: Paxos

Client

Leader Replica

Replica

Page 18: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Example: Paxos

Client

Leader Replica

Replica

request

Page 19: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Example: Paxos

Client

Leader Replica

Replica

request prepare

Page 20: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Example: Paxos

Client

Leader Replica

Replica

request prepare prepareok

Page 21: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Example: Paxos

Client

Leader Replica

Replica

request prepare prepareok

exec

Page 22: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Example: Paxos

Client

Leader Replica

Replica

request prepare prepareok reply

commitexec

Page 23: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Example: Paxos

latency: 4 message delays

Client

Leader Replica

Replica

request prepare prepareok reply

commitexec

Page 24: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Example: Paxos

latency: 4 message delays

Client

Leader Replica

Replica

request prepare prepareok reply

commitexec

throughput: bottleneck replica processes 2n msgs

Page 25: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Outline1. Co-designing Distributed Systems and

Data Center Networks

2. Background: State Machine Replication & Paxos

3. Mostly-Ordered Multicast andSpeculative Paxos

4. Evaluation

Page 26: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Improving Paxos PerformancePaxos requires a leader replica to order requests

Can we use the network instead?

Page 27: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Improving Paxos PerformancePaxos requires a leader replica to order requests

Can we use the network instead?

Engineer the network to provideMostly-Ordered Multicast (MOM) - best-effort ordering of multicasts

New replication protocol: Speculative Paxos - commits most operations in a single round trip

Page 28: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Mostly-Ordered MulticastConcurrent messages are ordered: If any node receives message A then B, then all other receivers process them in the same order

• best effort — not guaranteed

Practical to implement • can be violated in event of network failure • but not satisfied by existing multicast protocols!

Page 29: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Mostly-Ordered Multicast

Page 30: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Mostly-Ordered Multicast• Different path

lengths, congestion cause reordering

Page 31: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Mostly-Ordered Multicast• Different path

lengths, congestion cause reordering

Page 32: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Mostly-Ordered Multicast• Different path

lengths, congestion cause reordering

Page 33: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Mostly-Ordered Multicast• Different path

lengths, congestion cause reordering

Page 34: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Mostly-Ordered Multicast• Different path

lengths, congestion cause reordering

• MOM approach:Route multicast messages to a root switch equidistant from receivers

Page 35: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Mostly-Ordered Multicast• Different path

lengths, congestion cause reordering

• MOM approach:Route multicast messages to a root switch equidistant from receivers

Page 36: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

MOM Design Options

better ordering

less network support

Page 37: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

MOM Design Options

1. Topology-Aware Multicastroute packets to a randomly-chosen root switch

better ordering

less network support

Page 38: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

MOM Design Options

1. Topology-Aware Multicastroute packets to a randomly-chosen root switch

2. High-Priority Multicastuse higher QoS priority to avoid link congestion

better ordering

less network support

Page 39: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

MOM Design Options

1. Topology-Aware Multicastroute packets to a randomly-chosen root switch

2. High-Priority Multicastuse higher QoS priority to avoid link congestion

3. Network Serializationroute packets through a single root switch

better ordering

less network support

Page 40: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Speculative PaxosNew state machine replication protocol Relies on MOM to order requests in the normal case But not required:

• remains correct even with reorderings:safety + liveness under usual conditions

Page 41: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Speculative Paxos

Client

Replica

Replica

Page 42: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Speculative Paxos

Client

Replica

Replica

request

Page 43: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Speculative Paxos

Client

Replica

Replica

request spec-reply(result, hash)

spec-exec

spec-exec

spec-exec

replicas immediately speculatively execute request & reply!

Page 44: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Speculative Paxos

Client

Replica

Replica

request spec-reply(result, hash)

spec-exec

spec-exec

spec-exec

match?

replicas immediately speculatively execute request & reply!client checks for matching responses from 3/4 superquorum

Page 45: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Speculative Paxos

Client

Replica

Replica

request spec-reply(result, hash)

spec-exec

spec-exec

spec-exec

match?

latency: 2 message delays (vs 4)

replicas immediately speculatively execute request & reply!client checks for matching responses from 3/4 superquorum

Page 46: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Replica

Speculative Paxos

Client

Replica

Replica

request spec-reply(result, hash)

spec-exec

spec-exec

spec-exec

match?

latency: 2 message delays (vs 4)

replicas immediately speculatively execute request & reply!

no bottleneck replica each processes only 2 msgs

client checks for matching responses from 3/4 superquorum

Page 47: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Speculative ExecutionReplicas execute requests speculatively • might have to roll back operations

Clients know their requests succeeded • they check for matching hashes in replies

• means clients don’t need to speculate

Similar to Zyzzyva [SOSP’07]

Page 48: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Handling Ordering ViolationsWhat if replicas don’t execute requests in the same order?

Replicas periodically run synchronization protocol

If divergence detected: reconciliation • replicas pause execution, select leader, send logs

• leader decides ordering for operations and notifies replicas

• replicas rollback and re-execute requests in proper order

Page 49: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Handling Ordering ViolationsWhat if replicas don’t execute requests in the same order?

Replicas periodically run synchronization protocol

If divergence detected: reconciliation • replicas pause execution, select leader, send logs

• leader decides ordering for operations and notifies replicas

• replicas rollback and re-execute requests in proper orderNote: 3/4 superquorum requirement ensures new leader can always be sure which requests succeeded even if 1/2 fail. [cf. Fast Paxos]

Page 50: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Outline

1. Co-designing Distributed Systems and Data Center Networks

2. Background: State Machine Replication & Paxos

3. Mostly-Ordered Multicast and Speculative Paxos

4. Evaluation

Page 51: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Evaluation Setup12-switch fat tree testbed1 Gb / 10 Gb ethernet3 replicas (2.27 GHz Xeon L5640)

MOM scalability experiments:2560-host simulated fat tree data center network background traffic from Microsoft data center measurements

Page 52: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

SpecPaxos Improves Latency and Throughput

latency (us)

throughput (ops / second)

(emulated datacenter network with MOMs)

better ↑

better →

Page 53: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

-1200

-900

-600

-300

0

0 25,000 50,000 75,000 100,000

SpecPaxos Improves Latency and Throughput

latency (us)

throughput (ops / second)

Paxos

SpecPaxos

(emulated datacenter network with MOMs)

better ↑

better →

Page 54: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

-1200

-900

-600

-300

0

0 25,000 50,000 75,000 100,000

SpecPaxos Improves Latency and Throughput

latency (us)

throughput (ops / second)

Paxos

SpecPaxos

3x throughput and 40% lower latency than Paxos

(emulated datacenter network with MOMs)

better ↑

better →

Page 55: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

-1200

-900

-600

-300

0

0 25,000 50,000 75,000 100,000

SpecPaxos Improves Latency and Throughput

latency (us)

throughput (ops / second)

PaxosFast Paxos

SpecPaxos

Paxos + batching

(emulated datacenter network with MOMs)

better ↑

better →

Page 56: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

-1200

-900

-600

-300

0

0 25,000 50,000 75,000 100,000

SpecPaxos Improves Latency and Throughput

latency (us)

throughput (ops / second)

PaxosFast Paxos

SpecPaxos

Paxos + batching

(emulated datacenter network with MOMs)

better ↑

better →

better latency than Fast Paxos and same throughput as batching!

Page 57: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

MOMs Provide Necessary SupportTh

roug

hput

0

30000

60000

90000

120000

Simulated packet reordering rate0.001% 0.01% 0.1% 1%

Speculative Paxos Paxos

Page 58: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

MOM Ordering Effectiveness

Testbed(12 switches)

Simulation (119 switches,

2560 hosts)

Regular Multicast 1-10% 1-2%

Topology-Aware MOM 0.001%-0.05% 0.01%-0.1%

Network Serialization ~0% ~0%

Ordering Violation Rates

Page 59: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

Paxos

Paxos+Batching

Fast Paxos

SpecPaxos

0 1500 3000 4500 6000

Max Throughput (Transactions/second)

Application PerformanceTransactional key-value store (2PC + OCC)Synthetic workload based on Retwis Twitter clone

< 250 LOC required to implement rollback

Measured transactions/sec that meet 10 ms SLO

Page 60: Designing Distributed Systems using Approximate Synchrony ...courses.cs.washington.edu/courses/csep552/16wi/slides/lec8-adriana.pdfModern data center: ~50,000 commodity servers constant

SummaryNew approach to building distributed systemsbased on co-designing with the data center network

Dramatic performance improvement for replication by combining

• MOM network primitive for best-effort ordering

• Speculative Paxos: efficient replication protocol

This is only the first step for co-designing distributed systems and data center networks!