dual-leader master election for distributed systems...

Dual-leader Master Election for Distributed Systems (Obiden)

Jeremy Sorensen Allan Xiao David Allender

[email protected] [email protected] [email protected]

1

mailto:[email protected]



Introduction

Objective To design and implement the Obiden algorithm, a variation of the Raft algorithm for distributed consensus with a president and a vice president to provide better throughput in large clusters.

Problem The distributed consensus problem is when a cluster of servers has to maintain state across all nodes in order to provide safety in the presence of server failures or communication failures. It consists of ensuring that the correct state is recorded on all servers in the correct order and that no server has committed state that differs from any other server.

Relation to class subject matter This project deals with the design of a novel algorithm for solving the consensus problem. It involves designing the algorithm.

Problems with existing approaches The Paxos algorithm [1,8] has long been the defacto standard for distributed consensus, however the algorithm as presented by the author has two problems. First it is very difficult to understand (by many people, not just us), second it leave so many details out that every practical implementation has been a significant departure from the actual algorithm. The Raft algorithm was created as an alternative to the Paxos algorithm for distributed consensus. It was intended to be easier to understand and implement than Paxos while maintaining similar performance. By making use of a strong leader it was largely successful. The author of the Raft algorithm did note however that as the number of nodes in the cluster increases, throughput can suffer due to the single leader becoming a communication bottleneck. [2]

Improvement claim The Obiden algorithm constitutes a marginal increase in complexity in order to improve throughput in large cluster situations. By splitting the work of contacting all servers among a president and vicepresident we can increase throughput. The vicepresident is appointed by the president and succession is well defined in the case of the loss of either the president or the vice president. Thus the Raft election process is essentially unchanged, and the sending of messages is only slightly more complicated. Other aspects of the algorithm are largely uneffected.

Problem statement Development of the Obiden distributed consensus algorithm.

Scope of investigation The election process of the Raft algorithm will be extended in four ways.

2

1. The normal election process will have an additional appointment step where the president appoints a vicepresident once he is elected.

2. In the case that the vicepresident fails or loses communication, the president will appoint a new vicepresident.

3. To prevent a failed vicepresident from causing a new election, when hosts timeout, they send a new message to the president requesting an AppendEntries message and reset their timer. If they receive a new AppendEntries from the president or a new vicepresident the election is averted.

4. The vicepresident will have a minimum election timeout; in the case that the president fails or loses communication, the new president will be elected by the normal raft election process but the vicepresident will be significantly more likely to become president.

After the election phase, the the sending of data/heartbeat messages will be altered so that the president sends the message to half of the cluster, including the vicepresident, and then the vicepresident sends the message to the other half of the cluster. For the purposes of this paper the rest of the algorithm, for example cluster membership changes and log compaction [2], will be largely ignored. Our implementation will be able to send data messages and responses in both “normal” (simplified) Raft mode and in Obiden mode. Parameters such as message data size and possibly artificial delays will be used to approximate the performance of the actual Raft algorithm as implemented and profiled in the original Raft paper. Then our implementation will be run in Obiden mode under the same conditions to measure any improvement in throughput. The algorithm will be implemented in C++ or Python and several (between 3 and 11) process will communicate with each other using sockets.

Theoretical Basis and Literature Review

Distributed Consensus In a client server environment where the state of the server critically important, replication across multiple servers is often used to increase safety. Unfortunately if the replicated hosts are very closely located they are not immune to things like power outages, natural disasters or other events that could disable or destroy the replicated servers in a single instant. Thus the servers should be spread out and joined by interconnects. Unfortunately these interconnects introduce their own problems; messages can now be lost, corrupted, delayed or repeated. Now instead of lost information there is the possibility of incorrect information. To combat this problem significant research [36] has been done on how best to establish “consensus” that is, a guarantee that if any host commits its state and any step, no other hosts commits a different state at that same step.

Paxos Arguably the most well known algorithm for achieving consensus is Paxos. [1] A state is proposed by one or more host processes (acting as ‘proposers”) and are voted on by the other hosts (acting as “acceptors”). This actually happens twice in what are called the prepare phase

3

and the accept phase. If a proposal is accepted by a majority of hosts in the accept phase, then the accepted proposal is sent to all hosts and committed. Paxos in its original form confines itself to the problem of achieving consensus on a single state, and suggests using a separate instance of Paxos for each additional state that needs to be committed. Monotonically increasing ballot numbers are used to ensure that only the latest proposal gets accepted by the majority of voters. Algorithms for recording an entire log of state are called MultiPaxos algorithms. Even the single state version has caused many researchers difficulties in implementation. Both because the algorithm is difficult to understand and because as described it does not sufficiently describe all aspects needed to create a practical system. Implementers are left to create code that is not based on any part of the algorithm and therefore not proven to be correct.

Raft The authors of the Raft paper found that many people, even those who had done significant work with Paxos, found it difficult to understand. They wanted to make an algorithm that would be easier to understand and implement. [2] Part of their design criteria was to maximize understandability whenever possible. The two most significant changes they made were first designing the algorithm around created a state log rather than a single state, and second, using a strong leader approach. Paxos uses a weakleader approach, any node can make a proposal and all nodes vote to accept the proposal or not. This allows the algorithm to always work even if half (minus 1) of the nodes fail. With raft there is one leader who dictates the states and all other hosts commit the state that the leader dictates. This seems like it would cause a problem since the leader is a single point of failure. In fact rather that be continuously immune from host failures, raft succeeds by have a normal state and a recovery state. Normally the leader dictates to the other hosts. In the event of a leader failure, a new leader is elected and the process recovers. This allows the normal operation to be greatly simplifies and relegates the consensus part of the algorithm to a single perhaps costly by hopefully rare event. Because there is a single leader in Raft, it is very important that everybody know who the leader is. When the leader sends out data messages it is clear to the host who the leader is, it is the host sending the messages. When no data is being sent the situation is less clear. Raft tries to be easy to understand so the solution the authors chose is very simple, send data messages even if there is no data to send. The authors call this the “heartbeat.” As long as hosts keep getting the heartbeat, there is a leader. If it stops, an election process is started to determine a new leader. The strong leader approach has been very successful, the number of actual raft implementations being used in real systems is quite large. [7] Having a strong leader simplifies the algorithm conceptually and in terms of implementation. The election process itself is also much simpler than the Paxos proposal process. In Raft, when an election is needed all hosts randomly generate a time interval and wait for it to expire. If in that time they receive a request

4

from another node they vote for it. Otherwise they send out a request themselves. The host (if any) who gets a majority of votes becomes the leader. In the case of a split vote, the election is simply carried out again. There is very little downside to the raft algorithm and in most situations seems to work well. In the raft paper however a potential limitation is pointed out. The performance of the algorithm drops off as the single leader has to send more data to larger clusters. Because the single leader is responsible for updating all the hosts it can begin to be a bottleneck

Obiden Our algorithm attempts to address the single leader bottleneck with a single simple change. Allow a single host to help the leader communicate with the other hosts. Thus the leader is now called the president, and the helper is called the vicepresident. There are several implications to this change.

1. The heartbeat and data have to be propagated through the vice president fast enough to prevent an election from starting.

2. If the vicepresident fails, there needs to be a mechanism to prevent the hosts who stopped receiving the propagated heartbeat from attempting to elect another president. If the hosts know who the president is, they can send a message to the president which can send a heartbeat (or data) response to prevent the election until a new vicepresident is elected.

3. If the vicepresident is in some way closer (for instance has lower latency or a more robust connection) to the president it should help with performance or availability.

4. Due to the previous point and the fact that the president communicates through it, it is reasonable to assume that the vicepresident is likely to be close to “caught up” with the president. This makes it a good choice for president, should the current one fail. Rather than try to guarantee this, we give the vicepresident an advantage during the election (minumum election timeout) and then let the normal Raft process determine the new president.

5. If the vicepresident fails or a message to the vicepresident is lost, it must be ensured that the hosts who receive messages from the vicepresident do not attempt to elect a new president. Also the lost message(s) must be resent directly from the leader until a new vicepresident is appointed.

The Obiden algorithm attempts to make as few changes to Raft as possible while still dealing with the above ramifications to the addition of the vicepresident role. It represents a weakening of the strong leader, but not in a way that has been done in other research. In other weak leader algorithms, decisionmaking authority over the state is delegated to the hosts, in Obiden it stays with the president. The vicepresident simply helps to get the word out to the other hosts. We believe this makes the Obiden algorithm unique. We believe our algorithm is nearly as simple as the Raft algorithm, in fact it can be compatible with the Raft algorithm; in the case of a small number of servers an Obiden implementation

5

should reduce to a Raft implementation since the leader bottleneck won’t exist. More importantly, to ensure correctness, in the case that the president does not have a currently appointed vicepresident, it sends messages to all hosts, thus reducing to the Raft algorithm. However, for larger clusters, the Obiden algorithm should allow faster throughput by sharing the load of communicating with tol of the hosts with the vicepresident.

Goals Rather than completely implement Raft and Obiden, our aim is to create a partial implementation that demonstrates the election process and failure recovery of Raft, as well as representative throughput. We will emulate the context needed to produce similar metrics to the Log Cabin implementation of Raft as recorded in the Raft paper. Then we will introduce the changes needed to turn a Raft implementation into an Obiden implementation. We will examine the lection process with the addition of the vicepresidential appointment and measure the throughput of our algorithm in the same context in the same way to show the change in throughput. We believe that the Obiden algorithm will improve throughput in clusters with more than 5 hosts. We hope to run up to 11 hosts and measure throughput improvements.

Methodology

How to generate and collect input data In the context of consensus algorithms input data can mean two things. The first would be the input from clients which drives the state machine and which is recorded in the state log replicated across the hosts. The second is the parameters and perturbations that affect the performance of the algorithm. These consist of the frequency and of the messages, message failures, delays and duplications, and host failures and restarts. This “input” is what is of interest in our test. As such the actual message will be arbitrary and in fact given a message size, the same data will be used in all messages. This will reduce the effect of host data generation when measuring message throughput. The other ‘input’ will consist of a series of tabulated test conditions and failure events. These may be partially randomly generated, but not on the fly since every test will be repeated for the Raft and Obiden implementations for comparison.

How to solve the problem

Algorithm design

6

Follower State

This flow is very similar to the Raft flow with one basic change. Since a follower may not get a message due to a failure of the vicepresident, if it has a timeout it sets a flag and then sends a special RequestAppendEntries message to the president directly. In response the president will send a duplicate AppendEntries, but not through the vice president. In this case the timer is reset and flow continues in the follower state. If the follower has a timeout and the flag is already set, it moves to the candidate state just like in Raft. The flow to handle the

7

AppendEntries message is different for Obiden than for Raft. Handling AppendRequest is the same as for Raft if the host is not the vicepresident. However at the end of normal processing the host checks if the vpId is its own ID. In this case the host is vicepresident and must do extra processing. It forwards the AppendEntries message its hosts and then waits for responses from the hosts, up to a 300 millisecond timeout. When hosts respond the vicepresident records that they responded as well as their response (success, term). If all hosts respond, or the 300 millisecond timeout occurs the vicepresident takes the result and responds to the original AppendEntries message passing the information to the president.

8

Candidate State

The candidate state happens exactly the same for Raft and Obiden. The term incremented, requests are sent to all hosts and if a majority are received the candidate becomes president. If a majority is not received, the candidate does not automatically become a follower, a AppendEntries must be received from the president to become a follower. If not a new election is started for a new term.

9

PresidentState

The president state is mostly equivalent to the leader state in Raft. The president sends AppendEntries, either with data or empty to prevent a new election. As the hosts responds, the president updates the indices for each host, including handling log inconsistencies and old terms. If there is no vicepresident this happens the same as Raft, except that the first host to respond will be appointed vicepresident. When there is a vice president the president does not send AppendEntries to all hosts, instead it sends to half of the hosts, including the vicepresident, and the vicepresident sends to the other half. One Obiden addition is the special RequestAppendEntries message which hosts will send if they do not receive an AppendEntries, this gives the president one chance to resend AppendEntries to the host directly and prevent losing the presidency.

10

Vice President

A new Vice President is elected for every round of AppendEntries request/response cycle. The Vice President is chosen by the President once the President finishes processing the AppendEntriesResponses collected for the previous round from all the hosts. The Vice President is a member of the largest group of hosts whose nextIndex values are identical. The Vice President then receives the AppendEntries from the President and forwards these packets to its own group of hosts. The moment it sends out these packets, the VP starts a VP timer and starts waiting for the responses from the hosts. As the VP collects the responses, if it has received responses from all the hosts in its group or the VP timer runs out, whichever one happens first, it will forwards the cumulative response of its own group of hosts to the President. The VP will then revert to the follower state, waiting to be elected in the next round of AppendEntries cycle.

Language used The original Raft implementation was originally done in Google’s Go programming language, and ported to many other languages. The most widely used implementation for this algorithm is

11

done in etcd, which is done in Go. For this particular project, C++ or Python will be the primary language used.

Tools used In order to do a successful simulation, a network of computers or servers will be needed. Since the primary goal for this algorithm is to determine bandwidth and latency within a network when implementing this new masterelection algorithm, there will need to be a networking device, as well as a number of computers to not only participate in the cluster, but one computer that will be needed to monitor network traffic. Besides the said hardware used, the team will be using various IDEs for developing the Obiden algorithm.

How to generate output In order to generate the output, we will need an instance of the program running on each of the available machines on the network. Using large packet sizes, bandwidth consumption will take a performance hit and will need to be monitored by a nonclustered machine. This nonclustered machine will keep track of latency to an outofnetwork location and determine if the dualleader system will reduce network congestion.

How to test against hypothesis As stated in the previous section, the output will be benchmarked with the singleleader Raft implementation and performance metrics will be measured. The network latency should be much less with the dualleader election than with the single Raftbased implementation.

Implementation The code written for this project can be found at the Github link located in the appendix at the end of the document

Assumptions

No Network Partitions Although dropped packets are more common in UDP, we will not be accounting for any particular network partitions within this algorithm.

Static Node Allocation There will only be a certain number of nodes that can participate in the Obiden algorithm at once, and that number will be the same max as what was tested in Raft. There are no more

12

than 11 nodes that will participate, and they will also be predefined in the config file as stated in the section below. Nodes that are not in the config file cannot join an existing cluster, and the config file will need to be changed and services restarted for a new node to be accepted.

Code Since certain assumptions were made for this algorithm that were described in the Assumptions section, the code may look somewhat similar to a strippeddown version of the Raft algorithm written in Go.

Before Running (Initialization) A host network info configuration file must be created by the network admin that contains the IP addresses and port information for the nodes that are participating in the algorithm. This file is parsed by main and will be used for network setup and communication. The second command line argument that must be passed in, is the local node’s internal IP address, followed by the port in this format: 123.456.789:1234. The main uses this information to do a lookup on the host_info to determine which node is the local host to set up the networking configuration. The last entry in the host configuration file is always the client.

Main The program starts off by parsing the necessary information to create a Network class instance and starting a listening thread that’s used for network communication. This thread is used to receive all incoming packets for that particular node. A timer thread is also created to inform the president that he/she has not received any messages from any of the clients, and should take some action <INSERT ACTION HERE>. From here, the code diverges and splits into each independent state depending on the role of the node. There are four states that each node can “turnin” to. The initial node state is the “follower” state.

President The role of the president is to figure out which hosts are out of date, and group them by the index they are located. It takes the largest group of hosts and checks to see if it has at least three members. It chooses a member and makes it the Vice President and sends an append entries with his index as vp_index. Once it does that, it sends to all other groups one at a time.

13

Vice President The Vice President shares the workload of the President. The Vice President is elected if it is one of the hosts (followers) in the largest group (based on the responses from the hosts in the previous AppendEntries cycle). The Vice President receives the AppendEntries packet from the President and forwards it to the follower hosts under its control. The Vice President has its own timer that is shorter than the election timer. The Vice President would collect the responses from its own group of followers and sends their responses to the President when its timer times out. It then reverts to the follower state. A new Vice President is elected every time a nonempty AppendEntries is sent out from the President. The first host of the group having the largest number of identical last_log_index is chosen as the Vice President. Once the Vice President is chosen, it is assigned its own group of hosts (followers).

Candidate A follower enters the Candidate state if its election timer times out. This could happen when the president fails to send the periodic heartbeat packet (empty AppendEntries packet) or the vice president fails to deliver the packet to the host (follower). The candidate sends out RequestVote packets to the other hosts and checks if it has the majority votes from the RequestVoteResponse packets from those hosts. If it does, then it becomes the President and informs all the other hosts through empty AppendEntries packets.

Follower The only role for this state is to switch to the candidate state. Most of the follower’s actions are done in the handlers. It also sends the AppendEntries packet.

14

Design Documents & Flowcharts

Obiden mode (base case)

Raft Mode (base case)

15

Obiden Mode (Full house)

Timeout Matrix

Name Timeout Time

Heartbeat timeout 100ms

Election timeout 150300ms (randomized)

Vice President combined response 50ms

16

Data Analysis & Discussion

Output Generation To verify and test the output that we had originally hypothesized, a separate client will be needed to test the cluster and make sure that the Obiden algorithm is more efficient in terms of bandwidth allocation than Raft.

Output Analysis Significant effort went into developing a codebase to test the hypothesis. It was planned that client log entries would used as a metric to quantify the difference in performance between basic Raft and Obiden. The number of log entries per unit of time would be the principal metric, with more entries corresponding to better performance. Unfortunately the code was not functioning as of the project due date and thus the performance metrics were not taken.

Abnormal case explanation No abnormal cases were discovered due to the nonfunctioning code.

Statistic regression No regression of the performance data was completed.

Conclusions & Recommendations

Summary and conclusions We found that although Raft is not an overly complex algorithm, and Obiden is only marginally more difficult; in the time constraints of the project we were unable to get the code functioning. However, significant progress was made, and many important details of the algorithm became evident during the process. The distinction between a “best” case where all nodes are up to date, and a “worst” case where each node is at a different state and how that affects the ability of the vicepresident to alleviate bandwidth demands was an important detail that we did not appreciate until after the implementation was started. We were able to conceptualize more fully the consequences of the Obiden modifications and as a result we believe that the Obiden algorithm could lead to a performance increase in certain circumstances.

Recommendations for future studies Although we were unable to determine whether or not that, there were some features left out of the Obiden algorithm that were implemented in original Raft. For further work, the rest of the original Raft features would be implemented in this project, while reducing some initial assumptions made prior.

17

Another subset of additional work that can be performed would be the completion of the algorithm as fully implementing the Raft protocol with a secondary leader.

Bibliography [1] Lamport, L. Paxos made simple. ACM SIGACT News 32, 4 (2001), 18–25. [2] Ongaro, D., and Ousterhout, J. In Search of an Understandable Consensus Algorithm. In Proc. ATC’14, USENIX Annual Technical Conference (2014), USENIX. ii [3] Baker J., Bond C., Corbett J., Furman JJ, and Khorlin A., et al. Megastore: Providing scalable, highly available storage for interactive services. In CIDR, volume 11, pages 223234, 2011. [4] Burrows M.. The chubby lock service for looselycoupled distributed systems. In Proceedings of the 7th symposium on Operating systems design and implementation, pages 335350. USENIX Association, 2006. [5] Hunt P., Konar M., Junqueira F. P., and Reed B. Zookeeper: waitfree coordination for internetscale systems. In Proceedings of the 2010 USENIX conference on USENIX annual technical conference, volume 8, pages 1111, 2010. [6] Isard M.. Autopilot: automatic data center management. ACM SIGOPS Operating Systems Review, 41(2):6067, 2007. [7] Raft consensus algorithm website. http://raftconsensus.github.io. [8] Lamport, L. 1998. The parttime parliament. ACM Trans. Comput. Syst. 16, 2, 133–169.

Appendices

Program Flowchart Refer to methodology section for program flowcharts and diagrams

Program source code with documentation Source code can be found at our github repository: https://github.com/allenderd/obiden

18

https://github.com/allenderd/obiden