reliable distributed systems

76
Reliable Distributed Systems Scalability

Upload: brosh

Post on 16-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Reliable Distributed Systems. Scalability. Scalability. Today we’ll focus on how things scale Basically: look at a property that matters Make something “bigger” Like the network, the number of groups, the number of members, the data rate Then measure the property and see impact - PowerPoint PPT Presentation

TRANSCRIPT

  • Reliable Distributed SystemsScalability

  • ScalabilityToday well focus on how things scaleBasically: look at a property that mattersMake something biggerLike the network, the number of groups, the number of members, the data rateThen measure the property and see impactOften we can hope that no slowdown would occur. But what really happens?

  • Stock Exchange Problem: Sometimes, someone is slowMost members are healthy.

    but one is slowi.e. something is contending with the red process, delaying its handling of incoming messages

  • With a slow receiver, throughput collapses as the system scales up00.10.20.30.40.50.60.70.80.9050100150200250Virtually synchronous Ensemble multicast protocolsperturb rateaverage throughput on nonperturbed membersgroup size: 32group size: 64group size: 96

  • Why does this happen?Superficially, because data for the slow process piles up in the senders buffer, causing flow control to kick in (prematurely)But why does the problem grow worse as a function of group size, with just one red process?Small perturbations happen all the time

  • Broad picture?Virtual synchrony works well under bursty loadsAnd it scales to fairly large systems (SWX uses a hierarchy to reach ~500 users)From what weve seen so far, this is about as good as it gets for reliabilityRecall that stronger reliability models like Paxos are costly and scale far worseDesired: steady throughput under heavy load and stress

  • Protocols famous for scalabilityScalable reliable multicast (SRM)Reliable Multicast Transport Protocol (RMTP)On-Tree Efficient Recovery using Subcasting (OTERS)Several others: TMP, MFTP, MFTP/EC...

    But when stability is tested under stress, every one of these protocols collapses just like virtual synchrony!

  • Example: Scalable Reliable Multicast (SRM)Originated in work on Wb and MboneIdea is to do local repair if messages are lost, various optimizations keep load low and repair costs localizedWildly popular for internet push, seen as solution for Internet radio and TVBut receiver-driven reliability model lacks strong reliability guarantees

  • Local Repair Concept

  • Local Repair Concept

  • Local Repair Conceptlost

  • Local Repair ConceptNACKNACKNACKXReceipt of subsequent packet triggers NACK for missing packet

  • Local Repair ConceptNACKNACKXXXNACKReceive useless NAK, duplicate repairRetransmit

  • Local Repair ConceptXXXXXXNACKNACKReceive useless NAK, duplicate repairX

  • Local Repair ConceptNACKXXXXNACKNACKReceive useless NAK, duplicate repair

  • Local Repair ConceptXXXReceive useless NAK, duplicate repair

  • Limitations?SRM runs in application, not router, hence IP multicast of nacks and retransmissions tend to reach many or all processesLacking knowledge of who should receive each message, SRM has no simple way to know when a message can be garbage collected at the application layerProbabilistic rules to suppress duplicates

  • In practice?As the system grows large the probabilistic suppression failsMore and more NAKs are sent in duplicateAnd more and more duplicate data message are sent as multiple receivers respond to the same NAKWhy does this happen?

  • Visualizing how SRM collapsesThink of sender as the hub of a wheelMessages depart in all directionsLoss can occur at many places out there and they could be far apartHence NAK suppression wont workCausing multiple NAKSAnd the same reasoning explains why any one NAK is likely to trigger multiple retransmissions!Experiments have confirmed that SRM overheads soar with deployment sizeEvery message triggers many NAKs and many retransmissions until the network finally melts down

  • Dilemma confronting developersApplication is extremely critical: stock market, air traffic control, medical systemHence need a strong model, guaranteesBut these applications often have a soft-realtime subsystemSteady data generationMay need to deliver over a large scale

  • Today introduce a new design pt.Bimodal multicast (pbcast) is reliable in a sense that can be formalized, at least for some networksGeneralization for larger class of networks should be possible but maybe not easyProtocol is also very stable under steady load even if 25% of processes are perturbedScalable in much the same way as SRM

  • EnvironmentWill assume that most links have known throughput and loss propertiesAlso assume that most processes are responsive to messages in bounded timeBut can tolerate some flakey links and some crashed or slow processes.

  • Start by using unreliable multicast to rapidly distribute the message. But some messages may not get through, and some processes may be faulty. So initial state involves partial distribution of multicast(s)

  • Periodically (e.g. every 100ms) each process sends a digest describing its state to some randomly selected group member. The digest identifies messages. It doesnt include them.

  • Recipient checks the gossip digest against its own history and solicits a copy of any missing message from the process that sent the gossip

  • Processes respond to solicitations received during a round of gossip by retransmitting the requested message. The round lasts much longer than a typical RPC time.

  • Delivery? Garbage Collection?Deliver a message when it is in FIFO orderGarbage collect a message when you believe that no healthy process could still need a copy (we used to wait 10 rounds, but now are using gossip to detect this condition)Match parameters to intended environment

  • Need to bound costsWorries:Someone could fall behind and never catch up, endlessly loading everyone elseWhat if some process has lots of stuff others want and they bombard him with requests?What about scalability in buffering and in list of members of the system, or costs of updating that list?

  • OptimizationsRequest retransmissions most recent multicast firstIdea is to catch up quickly leaving at most one gap in the retrieved sequence

  • OptimizationsParticipants bound the amount of data they will retransmit during any given round of gossip. If too much is solicited they ignore the excess requests

  • OptimizationsLabel each gossip message with senders gossip round numberIgnore solicitations that have expired round number, reasoning that they arrived very late hence are probably no longer correct

  • OptimizationsDont retransmit same message twice in a row to any given destination (the copy may still be in transit hence request may be redundant)

  • OptimizationsUse IP multicast when retransmitting a message if several processes lack a copyFor example, if solicited twiceAlso, if a retransmission is received from far awayTradeoff: excess messages versus low latencyUse regional TTL to restrict multicast scope

  • ScalabilityProtocol is scalable except for its use of the membership of the full process groupUpdates could be costlySize of list could be costlyIn large groups, would also prefer not to gossip over long high-latency links

  • Can extend pbcast to solve bothCould use IP multicast to send initial message. (Right now, we have a tree-structured alternative, but to use it, need to know the membership)Tell each process only about some subset k of the processes, k
  • Router overload problemRandom gossip can overload a central routerYet information flowing through this router is of diminishing quality as rate of gossip risesInsight: constant rate of gossip is achievable and adequate

  • Load on WAN link (msgs/sec)Latency to delivery (ms)

  • Hierarchical GossipWeight gossip so that probability of gossip to a remote cluster is smallerCan adjust weight to have constant load on routerNow propagation delays rise but just increase rate of gossip to compensate

  • Remainder of talkShow results of formal analysisWe developed a model (wont do the math here -- nothing very fancy)Used model to solve for expected reliabilityThen show more experimental dataReal question: what would pbcast do in the Internet? Our experience: it works!

  • Idea behind analysisCan use the mathematics of epidemic theory to predict reliability of the protocolAssume an initial stateNow look at result of running B rounds of gossip: converges exponentially quickly towards atomic delivery

  • Either sender fails or data gets through w.h.p.

    Chart1

    0

    0.0291968046

    0.000581897

    0.0000102291

    0.0000001783

    0.0000000033

    0.0000000001

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0.0000000005

    0.0000000364

    0.0000035333

    0.0006171974

    0

    number of processes to deliver pbcast

    p{#processes=k}

    Pbcast bimodal delivery distribution

    Sheet1

    00.00E+00

    206.6298838.34570312.19E-036.43E-01101.78E-081.77E-0612.92E-02

    255.9189458.653321.53.02E-031.02E+00152.18E-111.62E-0525.82E-04

    305.440437.3408221.47E-039.52E-01203.42E-135.71E-0731.02E-05

    355.0712897.5048832.53.92E-045.38E-01257.87E-161.66E-0641.78E-07

    404.7636726.77343837.05E-052.07E-01308.89E-184.96E-0853.29E-09

    454.531256.8759773.59.48E-066.13E-02352.60E-209.97E-0866.67E-11

    504.3193366.33593841.01E-061.51E-02402.17E-223.10E-0971.53E-12

    4.59.00E-083.26E-03458.26E-255.38E-0984.05E-14

    56.85E-096.40E-04505.80E-271.76E-1091.26E-15

    5.54.60E-101.16E-04553.67E-292.82E-10104.63E-17

    62.88E-111.98E-05601.65E-311.56E-11112.04E-18

    6.51.91E-123.20E-06121.08E-19

    71.94E-135.07E-07136.95E-21

    7.54.57E-148.72E-08145.45E-22

    82.09E-142.03E-08155.23E-23

    8.51.35E-147.74E-09166.16E-24

    91.07E-144.37E-09178.93E-25

    9.59.55E-153.08E-09181.60E-25

    109.01E-152.50E-09193.53E-26

    209.65E-27

    213.27E-27

    221.37E-27

    237.13E-28

    244.59E-28

    253.67E-28

    263.62E-28

    274.43E-28

    286.68E-28

    291.24E-27

    302.85E-27

    318.00E-27

    322.75E-26

    331.16E-25

    345.91E-25

    353.67E-24

    362.76E-23

    372.51E-22

    382.75E-21

    393.62E-20

    405.74E-19

    411.10E-17

    422.53E-16

    437.06E-15

    442.40E-13

    451.01E-11

    465.30E-10

    473.64E-08

    483.53E-06

    496.17E-04

    500.00E+00

    Sheet1

    Predicate I for 1E-9 reliability

    Predicate II for 1E-12 reliability

    #processes in system

    fanout

    Fanout required for a specified reliability

    Sheet2

    Predicate I

    Predicate II

    fanout

    P{failure}

    Effects of fanout on reliability

    Sheet3

    Predicate I

    Predicate II

    #processes in system

    P{failure}

    Scalability of Pbcast reliability

    number of processes to deliver pbcast

    p{#processes=k}

    Pbcast bimodal delivery distribution

  • Failure analysisSuppose someone tells me what they hope to avoidModel as a predicate on final system stateCan compute the probability that pbcast would terminate in that state, again from the model

  • Two predicatesPredicate I: A faulty outcome is one where more than 10% but less than 90% of the processes get the multicast Think of a probabilistic Byzantine Generals problem: a disaster if many but not most troops attack

  • Two predicatesPredicate II: A faulty outcome is one where roughly half get the multicast and failures might conceal true outcome this would make sense if using pbcast to distribute quorum-style updates to replicated data. The costly hence undesired outcome is the one where we need to rollback because outcome is uncertain

  • Two predicatesPredicate I: More than 10% but less than 90% of the processes get the multicast Predicate II: Roughly half get the multicast but crash failures might conceal outcomeEasy to add your own predicate. Our methodology supports any predicate over final system state

  • Chart2

    0.0000017650.0000000178

    0.00001619960

    0.00000057090

    0.00000165980

    0.00000004960

    0.00000009970

    0.00000000310

    0.00000000540

    0.00000000020

    0.00000000030

    00

    Predicate I

    Predicate II

    #processes in system

    P{failure}

    Scalability of Pbcast reliability

    Sheet1

    00.00E+00

    206.6298838.34570312.19E-036.43E-01101.78E-081.77E-0612.92E-02

    255.9189458.653321.53.02E-031.02E+00152.18E-111.62E-0525.82E-04

    305.440437.3408221.47E-039.52E-01203.42E-135.71E-0731.02E-05

    355.0712897.5048832.53.92E-045.38E-01257.87E-161.66E-0641.78E-07

    404.7636726.77343837.05E-052.07E-01308.89E-184.96E-0853.29E-09

    454.531256.8759773.59.48E-066.13E-02352.60E-209.97E-0866.67E-11

    504.3193366.33593841.01E-061.51E-02402.17E-223.10E-0971.53E-12

    4.59.00E-083.26E-03458.26E-255.38E-0984.05E-14

    56.85E-096.40E-04505.80E-271.76E-1091.26E-15

    5.54.60E-101.16E-04553.67E-292.82E-10104.63E-17

    62.88E-111.98E-05601.65E-311.56E-11112.04E-18

    6.51.91E-123.20E-06121.08E-19

    71.94E-135.07E-07136.95E-21

    7.54.57E-148.72E-08145.45E-22

    82.09E-142.03E-08155.23E-23

    8.51.35E-147.74E-09166.16E-24

    91.07E-144.37E-09178.93E-25

    9.59.55E-153.08E-09181.60E-25

    109.01E-152.50E-09193.53E-26

    209.65E-27

    213.27E-27

    221.37E-27

    237.13E-28

    244.59E-28

    253.67E-28

    263.62E-28

    274.43E-28

    286.68E-28

    291.24E-27

    302.85E-27

    318.00E-27

    322.75E-26

    331.16E-25

    345.91E-25

    353.67E-24

    362.76E-23

    372.51E-22

    382.75E-21

    393.62E-20

    405.74E-19

    411.10E-17

    422.53E-16

    437.06E-15

    442.40E-13

    451.01E-11

    465.30E-10

    473.64E-08

    483.53E-06

    496.17E-04

    500.00E+00

    Sheet1

    00

    00

    00

    00

    00

    00

    00

    Predicate I for 1E-9 reliability

    Predicate II for 1E-12 reliability

    #processes in system

    fanout

    Fanout required for a specified reliability

    Sheet2

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Predicate I

    Predicate II

    fanout

    P{failure}

    Effects of fanout on reliability

    Sheet3

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Predicate I

    Predicate II

    #processes in system

    P{failure}

    Scalability of Pbcast reliability

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    number of processes to deliver pbcast

    p{#processes=k}

    Pbcast bimodal delivery distribution

  • Chart3

    0.6430.002185996

    1.020.003018233

    0.9520.001467333

    0.5380.0003924257

    0.2070.0000705303

    0.06130.0000094769

    0.01510.0000010141

    0.003260.00000009

    0.000640.0000000068

    0.0001160.0000000005

    0.00001980

    0.00000320

    0.0000005070

    0.00000008720

    0.00000002030

    0.00000000770

    0.00000000440

    0.00000000310

    0.00000000250

    Predicate I

    Predicate II

    fanout

    P{failure}

    Effects of fanout on reliability

    Sheet1

    00.00E+00

    206.6298838.34570312.19E-036.43E-01101.78E-081.77E-0612.92E-02

    255.9189458.653321.53.02E-031.02E+00152.18E-111.62E-0525.82E-04

    305.440437.3408221.47E-039.52E-01203.42E-135.71E-0731.02E-05

    355.0712897.5048832.53.92E-045.38E-01257.87E-161.66E-0641.78E-07

    404.7636726.77343837.05E-052.07E-01308.89E-184.96E-0853.29E-09

    454.531256.8759773.59.48E-066.13E-02352.60E-209.97E-0866.67E-11

    504.3193366.33593841.01E-061.51E-02402.17E-223.10E-0971.53E-12

    4.59.00E-083.26E-03458.26E-255.38E-0984.05E-14

    56.85E-096.40E-04505.80E-271.76E-1091.26E-15

    5.54.60E-101.16E-04553.67E-292.82E-10104.63E-17

    62.88E-111.98E-05601.65E-311.56E-11112.04E-18

    6.51.91E-123.20E-06121.08E-19

    71.94E-135.07E-07136.95E-21

    7.54.57E-148.72E-08145.45E-22

    82.09E-142.03E-08155.23E-23

    8.51.35E-147.74E-09166.16E-24

    91.07E-144.37E-09178.93E-25

    9.59.55E-153.08E-09181.60E-25

    109.01E-152.50E-09193.53E-26

    209.65E-27

    213.27E-27

    221.37E-27

    237.13E-28

    244.59E-28

    253.67E-28

    263.62E-28

    274.43E-28

    286.68E-28

    291.24E-27

    302.85E-27

    318.00E-27

    322.75E-26

    331.16E-25

    345.91E-25

    353.67E-24

    362.76E-23

    372.51E-22

    382.75E-21

    393.62E-20

    405.74E-19

    411.10E-17

    422.53E-16

    437.06E-15

    442.40E-13

    451.01E-11

    465.30E-10

    473.64E-08

    483.53E-06

    496.17E-04

    500.00E+00

    Sheet1

    00

    00

    00

    00

    00

    00

    00

    Predicate I for 1E-9 reliability

    Predicate II for 1E-12 reliability

    #processes in system

    fanout

    Fanout required for a specified reliability

    Sheet2

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Predicate I

    Predicate II

    fanout

    P{failure}

    Effects of fanout on reliability

    Sheet3

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Predicate I

    Predicate II

    #processes in system

    P{failure}

    Scalability of Pbcast reliability

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    number of processes to deliver pbcast

    p{#processes=k}

    Pbcast bimodal delivery distribution

  • Chart5

    8.3457036.629883

    8.653325.918945

    7.340825.44043

    7.5048835.071289

    6.7734384.763672

    6.8759774.53125

    6.3359384.319336

    Predicate I for 1E-8 reliability

    Predicate II for 1E-12 reliability

    #processes in system

    fanout

    Fanout required for a specified reliability

    Sheet1

    00.00E+00

    206.6298838.34570312.19E-036.43E-01101.78E-081.77E-0612.92E-02

    255.9189458.653321.53.02E-031.02E+00152.18E-111.62E-0525.82E-04

    305.440437.3408221.47E-039.52E-01203.42E-135.71E-0731.02E-05

    355.0712897.5048832.53.92E-045.38E-01257.87E-161.66E-0641.78E-07

    404.7636726.77343837.05E-052.07E-01308.89E-184.96E-0853.29E-09

    454.531256.8759773.59.48E-066.13E-02352.60E-209.97E-0866.67E-11

    504.3193366.33593841.01E-061.51E-02402.17E-223.10E-0971.53E-12

    4.59.00E-083.26E-03458.26E-255.38E-0984.05E-14

    56.85E-096.40E-04505.80E-271.76E-1091.26E-15

    5.54.60E-101.16E-04553.67E-292.82E-10104.63E-17

    62.88E-111.98E-05601.65E-311.56E-11112.04E-18

    6.51.91E-123.20E-06121.08E-19

    71.94E-135.07E-07136.95E-21

    7.54.57E-148.72E-08145.45E-22

    82.09E-142.03E-08155.23E-23

    8.51.35E-147.74E-09166.16E-24

    91.07E-144.37E-09178.93E-25

    9.59.55E-153.08E-09181.60E-25

    109.01E-152.50E-09193.53E-26

    209.65E-27

    213.27E-27

    221.37E-27

    237.13E-28

    244.59E-28

    253.67E-28

    263.62E-28

    274.43E-28

    286.68E-28

    291.24E-27

    302.85E-27

    318.00E-27

    322.75E-26

    331.16E-25

    345.91E-25

    353.67E-24

    362.76E-23

    372.51E-22

    382.75E-21

    393.62E-20

    405.74E-19

    411.10E-17

    422.53E-16

    437.06E-15

    442.40E-13

    451.01E-11

    465.30E-10

    473.64E-08

    483.53E-06

    496.17E-04

    500.00E+00

    Sheet1

    00

    00

    00

    00

    00

    00

    00

    Predicate I for 1E-8 reliability

    Predicate II for 1E-12 reliability

    #processes in system

    fanout

    Fanout required for a specified reliability

    Sheet2

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Predicate I

    Predicate II

    fanout

    P{failure}

    Effects of fanout on reliability

    Sheet3

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    Predicate I

    Predicate II

    #processes in system

    P{failure}

    Scalability of Pbcast reliability

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    number of processes to deliver pbcast

    p{#processes=k}

    Pbcast bimodal delivery distribution

  • DiscussionWe see that pbcast is indeed bimodal even in worst case, when initial multicast failsCan easily tune parameters to obtain desired guarantees of reliabilityProtocol is suitable for use in applications where bounded risk of undesired outcome is sufficient

  • Model makes assumptions...These are rather simplisticYet the model seems to predict behavior in real networks, anyhowIn effect, the protocol is not merely robust to process perturbation and message loss, but also to perturbation of the model itselfSpeculate that this is due to the incredible power of exponential convergence...

  • Experimental workSP2 is a large networkNodes are basically UNIX workstationsInterconnect is basically an ATM networkSoftware is standard Internet stack (TCP, UDP)We obtained access to as many as 128 nodes on Cornell SP2 in Theory CenterRan pbcast on this, and also ran a second implementation on a real network

  • Example of a questionCreate a group of 8 membersPerturb one member in style of Figure 1Now look at stability of throughputMeasure rate of received messages during periods of 100ms eachPlot histogram over life of experiment

  • Source to dest latency distributionsNotice that in practice, bimodal multicast is fast!

    Chart4

    00

    0.97019867550.9735099338

    0.02317880790.0198675497

    00.0066225166

    0.00331125830

    0.00331125830

    00

    00

    00

    00

    00

    00

    00

    00

    &A

    Page &P

    Pbcast with .05 sleep probability

    Pbcast with .45 sleep probability

    Inter-arrival spacing (ms)

    Probability of occurence

    Histogram of throughput for pbcast

    Histograms

    fifo/.05BinFrequencyfifo/.45BinFrequencyPbcast/.05BinFrequencyPbcast/.45BinFrequency

    Traditional Protocol with .05 sleep probabilityTraditional Protocol with .45 sleep probabilityPbcast with .05 sleep probabilityPbcast with .45 sleep probability

    0.001870.005680.0046280.00530.0059370.00500.0060.0050

    0.0031150.012130.0047730.01310.0059480.012930.0060080.01294

    0.003410.015170.0049740.015290.0060080.01570.0060210.0156

    0.0037610.0200.0051870.02230.0060250.0200.0060470.022

    0.004010.02500.005270.02580.0060290.02510.0060660.0250

    0.0042520.0300.0054040.0350.006030.0310.0060660.030

    0.0042830.03500.0055680.03530.0060370.03500.0060660.0350

    0.0045610.0400.0057090.0410.0060370.0400.0060660.040

    0.004570.04500.0058120.04520.0060470.04500.0060670.0450

    0.0045740.0500.0059050.0510.006050.0500.0060680.050

    0.0045790.05500.0059310.05500.0060520.05500.0060690.0550

    0.0045980.0610.0059450.0600.0060530.0600.0060740.060

    0.0046010.06500.0060270.06520.0060540.06500.0060760.0650

    0.0046040.0700.0062470.0700.0060540.0700.0060770.070

    0.004611More00.006373More00.006057More00.006079More0

    0.0046160.0064670.0060570.006079

    0.0046220.0066860.0060580.006083

    0.0046370.006760.0060590.006083

    0.0046420.0070870.006060.006084

    0.0046420.0074380.0060620.006085Traditional Protocol with .05 sleep probabilityTraditional Protocol with .45 sleep probabilityPbcast with .05 sleep probabilityPbcast with .45 sleep probability

    0.0046630.0079020.0060640.0060860.00568300

    0.0046640.0080030.0060650.0060870.0121331293294

    0.0046730.0080320.006070.0060890.015172976

    0.0046740.0080760.0060710.0060910.0202302

    0.0046740.0080930.0060710.0060920.0250810

    0.0046790.0088210.0060740.0060970.030510

    0.0046820.0091610.0060740.0060970.0350300

    0.0046840.0094390.0060760.0061020.040100

    0.0046840.0095980.0060760.0061020.0450200

    0.004690.0096980.0060770.0061060.050100

    0.0046950.0097020.0060770.0061070.0550000

    0.0046990.0098620.0060780.0061070.061000

    0.0047180.009930.0060790.0061110.0650200

    0.0047240.0099570.0060790.0061110.070000

    0.0047340.010050.0060830.006112

    0.0047350.0101240.0060830.006113299108302302

    0.0047390.0106250.0060840.006113

    0.0047630.0106520.0060850.006116

    0.0047650.0108220.0060850.006117

    0.0047660.0112220.0060850.006117Traditional Protocol with .05 sleep probabilityTraditional Protocol with .45 sleep probabilityPbcast with .05 sleep probabilityPbcast with .45 sleep probability

    0.0047690.0112510.0060860.0061180.00568300

    0.0047710.0115180.0060870.0061180.0121331293294

    0.0047750.0117170.0060870.006120.015172976

    0.004780.01190.0060880.006120.0202302

    0.0047810.0120050.0060880.0061210.0250810

    0.0047970.01210.0060890.0061210.030510

    0.0048040.0121780.0060890.0061220.0350300

    0.0048210.0123390.0060890.0061240.040100

    0.0048240.0124610.006090.0061240.0450200

    0.0048270.0126670.006090.0061240.050100

    0.0048290.0127050.0060910.0061240.0550000

    0.0048380.0127760.0060910.0061250.061000

    0.0048470.012880.0060920.0061260.0650200

    0.004850.0128810.0060920.0061260.070000

    0.0048620.0129590.0060920.006126

    0.0048630.013270.0060930.006126299108302302

    0.0049160.0135170.0060950.006127

    0.0049190.0135390.0060950.006129Traditional Protocol with .05 sleep probabilityTraditional Protocol with .45 sleep probabilityPbcast with .05 sleep probabilityPbcast with .45 sleep probability

    0.0049210.013810.0060970.0061310.0050.22742474920.027777777800

    0.0049280.0142010.0060980.0061330.010.71237458190.2870370370.97019867550.9735099338

    0.0049290.0143540.0060990.0061330.0150.05685618730.26851851850.02317880790.0198675497

    0.0049310.0144220.00610.0061330.0200.21296296300.0066225166

    0.0049380.0148380.00610.0061340.02500.07407407410.00331125830

    0.0049420.0152750.0061010.0061360.0300.04629629630.00331125830

    0.0049440.0153410.0061010.0061370.03500.027777777800

    0.0049570.0156270.0061020.0061380.0400.009259259300

    0.004960.0156410.0061030.006140.04500.018518518500

    0.0049610.0157230.0061030.0061410.0500.009259259300

    0.0050030.0157680.0061030.0061410.0550000

    0.005010.0158320.0061040.0061420.060.0033444816000

    0.0050160.0159540.0061040.0061430.06500.018518518500

    0.0050320.0162180.0061050.0061430.070000

    0.0050390.0162650.0061050.006144

    0.0050460.0164180.0061050.006144

    0.0050460.0168060.0061050.006144

    0.0050590.0168380.0061060.006145

    0.0050680.016980.0061060.006145

    0.0050830.0174190.0061060.006145

    0.0051180.0174540.0061070.006146

    0.0051290.0176560.0061070.006146

    0.0051310.0182410.0061070.006146

    0.0051510.0186210.0061090.006147

    0.0051710.0186810.0061090.006148

    0.0051720.0189150.006110.006148

    0.0051950.0191280.0061110.006149

    0.0051960.0199140.0061110.006149

    0.0051970.0206180.0061110.00615

    0.0052340.0206230.0061110.00615

    0.0052530.0209220.0061110.00615

    0.005260.0209340.0061120.006151

    0.0052820.0211980.0061130.006151

    0.0052880.0213060.0061130.006152

    0.0052940.0219240.0061140.006153

    0.0053230.0222670.0061140.006153

    0.0053360.0254720.0061140.006153

    0.0053520.0264850.0061140.006154

    0.0053610.0266770.0061150.006154

    0.0054680.0284970.0061150.006154

    0.0055610.0294520.0061150.006155

    0.0055940.0316370.0061150.006155

    0.0056340.0317270.0061150.006156

    0.0056470.0320650.0061160.006157

    0.0056550.0369850.0061170.006158

    0.0056560.0417480.0061180.006158

    0.0056690.0439830.0061180.006158

    0.0056810.0450930.0061180.006159

    0.0056930.0625090.0061180.006159

    0.0057060.0628730.006120.00616

    0.005720.006120.006162

    0.0057250.006120.006163

    0.0057270.0061210.006163

    0.0057320.0061220.006164

    0.0057410.0061220.006166

    0.0057440.0061230.006166

    0.0058090.0061230.006166

    0.0058320.0061230.006166

    0.005840.0061230.006166

    0.0058430.0061240.006168

    0.0058530.0061240.00617

    0.0058550.0061250.00617

    0.005870.0061250.00617

    0.0058720.0061250.006171

    0.0058840.0061250.006172

    0.0058940.0061260.006172

    0.0059070.0061270.006173

    0.0059360.0061280.006173

    0.005940.0061280.006173

    0.0059580.0061290.006174

    0.0059690.0061290.006175

    0.0059860.0061290.006176

    0.0060.0061290.006176

    0.0060080.0061290.006177

    0.0060150.0061290.006179

    0.0060250.0061290.00618

    0.0060340.006130.006181

    0.0060440.006130.006181

    0.0060510.0061310.006185

    0.0060660.0061320.006186

    0.0060820.0061320.006187

    0.0060910.0061320.006187

    0.0060960.0061340.006187

    0.0061060.0061340.006188

    0.0061140.0061360.006189

    0.0061310.0061370.006189

    0.0061490.0061380.00619

    0.0061590.0061380.006191

    0.0061710.0061380.006192

    0.0061950.0061390.006193

    0.0062010.006140.006193

    0.0062160.006140.006193

    0.0062190.0061410.006195

    0.0062350.0061420.006196

    0.0062420.0061420.006199

    0.0062540.0061420.006201

    0.0062630.0061420.006202

    0.0062730.0061430.006202

    0.0062790.0061430.006203

    0.006280.0061430.006204

    0.0062910.0061430.006204

    0.0063210.0061450.006205

    0.0063320.0061450.006205

    0.0063390.0061460.006206

    0.0063450.0061470.006206

    0.0063470.0061470.006207

    0.0063480.0061470.006208

    0.006350.0061490.006208

    0.0063550.006150.006208

    0.0063560.0061510.006211

    0.006370.0061510.006211

    0.006380.0061520.006214

    0.0063870.0061530.006216

    0.0063880.0061530.006218

    0.0063910.0061540.006221

    0.0063960.0061540.006222

    0.0064090.0061550.006223

    0.0064180.0061550.006223

    0.0064210.0061550.006224

    0.0064220.0061550.006224

    0.0064340.0061560.006224

    0.0064450.0061560.006227

    0.0064470.0061560.00623

    0.0064470.0061570.006231

    0.0064510.0061570.006234

    0.0064690.0061570.006234

    0.0064690.0061580.006235

    0.0064740.0061590.006235

    0.0064810.0061590.006236

    0.0064810.0061610.006237

    0.0064860.0061620.006238

    0.0064940.0061620.00624

    0.0064970.0061630.00624

    0.0065090.0061640.006241

    0.0065170.0061640.006242

    0.006520.0061650.006247

    0.0065280.0061650.006247

    0.0065630.0061650.00625

    0.0065770.0061670.00625

    0.0065970.0061680.006252

    0.006630.0061690.006252

    0.0066380.0061690.006254

    0.0066550.0061720.006254

    0.0066640.0061720.006254

    0.0066650.0061730.006255

    0.0066650.0061740.006256

    0.0066810.0061740.006258

    0.006690.0061750.00626

    0.0067040.0061760.006261

    0.0067380.0061760.006262

    0.006760.0061770.006263

    0.0067620.0061770.006263

    0.0067660.0061790.006264

    0.0067850.006180.006266

    0.0067960.006180.006268

    0.0068020.0061810.006268

    0.0068370.0061810.006268

    0.0068770.0061820.006271

    0.0068820.0061820.006272

    0.0069320.0061820.006274

    0.0069440.0061830.006274

    0.0069620.0061840.006274

    0.006970.0061840.006276

    0.0069780.0061850.006276

    0.0069840.0061860.006279

    0.0070170.0061940.006281

    0.0070320.0061950.006283

    0.0070570.0061980.006283

    0.0070910.0061980.006289

    0.0070950.0061990.006289

    0.0071230.0062030.006291

    0.0071390.0062050.006292

    0.0071480.0062060.006293

    0.007150.0062070.006295

    0.0071780.0062070.006299

    0.0071780.0062080.006301

    0.0071950.0062080.006302

    0.0072520.0062090.006302

    0.0072660.0062120.006303

    0.0073430.0062120.006303

    0.0073430.0062170.006308

    0.0073470.0062180.006313

    0.0073490.0062180.006318

    0.0073570.0062210.006322

    0.0074040.0062210.006323

    0.0076160.0062250.006328

    0.0076390.0062250.006335

    0.0077020.0062250.006338

    0.0077290.0062250.006342

    0.0078370.0062280.006345

    0.0079060.0062320.006348

    0.0079070.0062330.006356

    0.0079720.0062340.006367

    0.0080220.0062340.006375

    0.0080290.0062360.006381

    0.0081680.0062490.006381

    0.0081950.0062640.006384

    0.0082360.006270.006386

    0.0082380.0062770.006387

    0.0082420.0062890.006389

    0.0082580.0062980.006392

    0.0082750.0063080.006435

    0.0082990.0063270.00645

    0.0083840.0063340.006461

    0.0084050.0063370.006463

    0.0084890.0063370.006477

    0.0085030.0063420.006483

    0.0085470.0063670.006513

    0.0085970.0063740.006566

    0.0085980.0063940.006603

    0.0086010.0064260.006642

    0.0087170.0064260.006692

    0.008720.0064780.006711

    0.008890.0064920.006847

    0.0089120.006510.00686

    0.0089650.0065450.00686

    0.009020.0066230.006887

    0.009260.006710.006957

    0.0093070.0067880.007104

    0.009590.0069060.007113

    0.009640.0070740.007295

    0.0099740.0070970.007297

    0.0099940.0071120.007391

    0.0100220.0072970.007416

    0.010040.0073110.007757

    0.0100790.0074690.00821

    0.0101070.0074820.008302

    0.010240.0075290.008333

    0.010530.0079130.008349

    0.0109420.007930.0085

    0.0113920.007980.00876

    0.0115780.0082320.008958

    0.0116620.0086610.00903

    0.0116820.0087670.009121

    0.0118010.0092540.009175

    0.0119930.0101640.009281

    0.0137070.0104310.010412

    0.0138190.0110130.011389

    0.0140370.0127690.011573

    0.0144880.0129210.012047

    0.0566960.0135720.013404

    0.0139560.0143

    0.021560.018217

    0.0280.01829

    &A

    Page &P

    Histograms

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    &A

    Page &P

    Traditional Protocol with .05 sleep probability

    Traditional Protocol with .45 sleep probability

    Time to receive 100 messages

    Probability of occurence

    Histogram of throughput for Traditional Protocol

    g10.003fifo

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    00

    &A

    Page &P

    Pbcast with .05 sleep probability

    Pbcast with .45 sleep probability

    Time to receive 100 messages

    Probability of occurence

    Histogram of throughput for PBCast

    1\1\g1\5\g2\5\g

    Fifo/highPbcast/highFifo/lowPbcast/lowFIFO/hPbcast/hFIFO/lPbcast/lFIFO/hPbcast/hFIFO/lPbcast/l

    Traditional w/1 sleeperPbcast w/1 sleeperTraditional w/1 sleeperPbcast w/1 sleeperTraditional w/5 sleepersPbcast w/5 sleepersTraditional w/5 sleepersPbcast w/5 sleepers

    0.05151.262153.82599.999699.9981123.82150.75799.999999.999277.5588265.736161.412199.997

    0.15126.037153.98699.999799.998174.0949153.16100.00299.997771.1499264.722136.632199.995

    0.25101.96155.231100.00499.640765.026150.35796.038399.99967.6259262.719116.164199.997

    0.3577.0642145.21999.999799.999750.6761150.62573.306199.995559.6272267.606106.311199.992

    0.4563.9061153.02996.421199.994439.7611153.33158.5499.995752.2691260.94587.0828199.996

    0.5550.3154152.36775.08999.998431.1254151.62743.980499.996431.1254151.62743.980499.9964

    0.6539.4076153.88853.555599.99821.6599153.26533.497899.996921.6599153.26533.497899.9969

    0.7526.2399149.92739.227299.999114.7746153.56323.037596.84914.7746153.56323.037596.849

    0.8516.0008153.21722.373599.99799.07249152.90212.206499.99879.07249152.90212.206499.9987

    0.955.67649153.628.4509999.99772.9353156.2562.935399.99922.9353156.2562.935399.9992

    1/1/b

    FIFO/hPbcast/hFIFO/lPbcast/l

    1\3\gThroughput for traditional protocol, measured at faulty hostThroughput for Pbcast, measured at faulty host

    Traditional w/3 sleepersPbcast w 3/sleepers151.261153.83199.9998100.003

    154.991153.64399.9998153.25799.8381100.001

    112.362151.774102.277149.679100.00499.2005

    80.445151.8778.8386126.51799.999498.7813

    64.5427149.9563.8418116.21895.878116.218

    50.2844155.49250.027899.75174.765880.7126Throughput for traditional protocol, measured at correct host

    41.3491151.82839.078177.962453.128863.896

    25.3831153.22625.773953.385638.951343.8917Throughput for PBCast, measured at correct host

    19.4144150.58415.333730.327821.954330.3278

    9.07342153.4564.879068.210887.879528.21088

    3.4324152.63

    &A

    Page &P

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    &A

    Page &P

    Throughput for traditional protocol, measured at correct host

    Throughput for PBCast, measured at correct host

    Throughput for traditional protocol, measured at faulty host

    Throughput for Pbcast, measured at faulty host

    Probability of Sleep Event

    Average Throughput

    High Bandwidth comparison of PBCast performance atfaulty and correct hosts

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    000000

    &A

    Page &P

    Traditional w/1 sleeper

    Pbcast w/1 sleeper

    Traditional w/3 sleepers

    Pbcast w 3/sleepers

    Traditional w/5 sleepers

    Pbcast w/5 sleepers

    Probability of sleep event

    Throughput measured at unperturbed process

    High Bandwidth measurements with varying numbers of sleepers

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    0000

    &A

    Page &P

    Traditional w/1 sleeper

    Pbcast w/1 sleeper

    Traditional w/5 sleepers

    Pbcast w/5 sleepers

    Probability of Sleep Event

    Average Throughput

    Low Bandwidth measurements with varying numbers of sleepers

  • Now revisit Figure 1 in detailTake 8 machinesPerturb 1Pump data in at varying rates, look at rate of received messages

  • Revisit our original scenario with perturbations (32 processes)

  • Throughput variation as a function of scale

  • Impact of packet loss on reliability and retransmission rateNotice that when network becomes overloaded, healthy processes experience packet loss!

  • What about growth of overhead?Look at messages other than original data distribution multicastMeasure worst case scenario: costs at main generator of multicastsSide remark: all of these graphs look identical with multiple senders or if overhead is measured elsewhere.

  • Growth of Overhead?Clearly, overhead does growWe know it will be bounded except for probabilistic phenomenaAt peak, load is still fairly low

  • Pbcast versus SRM, 0.1% packet loss rate on all linksTree networks

    Star networks

  • Pbcast versus SRM: link utilization

  • Pbcast versus SRM: 300 members on a 1000-node tree, 0.1% packet loss rate

  • Pbcast Versus SRM: Interarrival Spacing

  • Pbcast versus SRM: Interarrival spacing (500 nodes, 300 members, 1.0% packet loss)

  • Real Data: Spinglass on a 10Mbit ethernet (35 Ultrasparcs)Injected noise, retransmission limit disabledInjected noise, retransmission limit re-enabled

    0

    10

    20

    30

    40

    50

    60

    70

    80

    90

    100

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    sec

  • Networks structured as clusters

  • Delivery latency in a 2-cluster LAN, 50% noise between clusters, 1% elsewhere

  • Requests/repairs and latencies with bounded router bandwidth

  • DiscussionSaw that stability of protocol is exceptional even under heavy perturbationOverhead is low and stays low with system size, bounded even for heavy perturbationThroughput is extremely steadyIn contrast, virtual synchrony and SRM both are fragile under this sort of attack

  • Programming with pbcast?Most often would want to split application into multiple subsystemsUse pbcast for subsystems that generate regular flow of data and can tolerate infrequent loss if risk is boundedUse stronger properties for subsystems with less load and that need high availability and consistency at all times

  • Programming with pbcast?In stock exchange, use pbcast for pricing but abcast for control operationsIn hospital use pbcast for telemetry data but use abcast when changing medicationIn air traffic system use pbcast for routine radar track updates but abcast when pilot registers a flight plan change

  • Our vision: One protocol side-by-side with the otherUse virtual synchrony for replicated data and control actions, where strong guarantees are needed for safetyUse pbcast for high data rates, steady flows of information, where longer term properties are critical but individual multicast is of less critical importance

  • SummaryNew data point in a familiar spectrumVirtual synchronyBimodal probabilistic multicastScalable reliable multicastDemonstrated that pbcast is suitable for analytic workSaw that it has exceptional stability