Week 50: Understanding Raft

Jul 17, 2023

Raft is a consensus algorithm for managing a replicated log. It decomposes the consensus problem into three independent subproblems as follows.

Leader election: a new leader must be chosen when an existing leader fails.
Log replication: the leader must accept log entries from clients and replicate them across the cluster, forcing the other logs to agree with its own.
Safety if any server has applied a particular log entry to its state machine, then no other server may apply a different command for the same log index.

Leader election

Election Process

Raft divides time into terms of arbitrary length, and each term begins with an election, in which one or more server attempt to become the leader. At any given time each server is in one of three states: leader, follower, or candidate.

In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader

If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.

Communication

Communication between server happen using remote procedure calls (RPCs), and it requires only two types of RPCs for driving consensus

RequestVote RPC: initiated by candidates during elections.
AppendEntries RPC: initiated by leaders to replicate log entries

Leader-follower protocol

The first step is to elect a distinguished leader, and then give the complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines.

Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.

Election process

Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server remains in follower state as long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats (AppendEntries RPCs that carry no log entries) to all followers in order to maintain their authority. If a follower receives no communication over a period of time called the election timeout, then it assumes there is no viable leader and begins an election to choose a new leader. a follower increments its current term and transitions to candidate state. - electing the leader which has the recent version of the log(key difference)

A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis

While waiting for votes, a candidate may receive an AppendEntries RPC from another server claiming to be leader. If the leader’s term (included in its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as legitimate and returns to follower state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and continues in candidate state. The third possible outcome is that a candidate neither wins nor loses the election: if many followers become candidates at the same time, votes could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a new election by incrementing its term and initiating another round of RequestVote RPCs. Raft uses randomized election timeouts to ensure that split votes are rare and that they are resolved quickly.

Log replication

The leader appends the command to its log as a new entry, then issues AppendEntries RPCs in parallel to each of the other servers to replicate the entry. When the entry has been safely replicated, the leader applies the entry to its state machine and returns the result of that execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries AppendEntries RPCs indefinitely (even after it has responded to the client) until all followers eventually store all log entries.

The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers.

The leader keeps track of the highest index it knows to be committed, and it includes that index in future AppendEntries RPCs (including heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it applies the entry to its local state machine

In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point.

When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its log. If a follower’s log is inconsistent with the leader’s, the AppendEntries consistency check will fail in the next AppendEntries RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries RPC. Eventually nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries will succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log (if any). Once AppendEntries succeeds, the follower’s log is consistent with the leader’s, and it will remain that way for the rest of the term. A leader never overwrites or deletes entries in its own log

Timing and availability

broadcastTime ≪ electionTimeout ≪ MTBF

broadcastTime is the average time it takes a server to send RPCs in parallel to every server in the cluster and receive their responses;
MTBF is the average time between failures for a single server.

In order to avoid availability gaps, new servers join the cluster as non-voting members (the leader replicates log entries to them, but they are not considered for majorities). Once the new servers have caught up with the rest of the cluster, it can start to act as leader.

Log compaction

Creating a Snapshot with metadata like the last included index and last entry in log can help in doing the consistency checks in the AppendEntries. The leader must occasionally send snapshots to followers that lag behind.

the leader would create a snapshot, then it would send this snapshot to each of its followers. However, this has two disadvantages.

Sending the snapshot to each follower would waste network bandwidth and slow the snapshotting process. Each follower already has the information needed to produce its own snapshots, and it is typically much cheaper for a server to produce a snapshot from its local state than it is to send and receive one over the network.
the leader’s implementation would be more complex. For example, the leader would need to send snapshots to followers in parallel with replicating new log entries to them, so as not to block new client requests. servers must decide when to snapshot.

One simple strategy is to take a snapshot when the log reaches a fixed size in bytes. If this size is set to be significantly larger than the expected size of a snapshot, then the disk bandwidth overhead for snapshotting will be small.

Safety

Election restriction

In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. Raft uses a simpler approach where it guarantees that all the committed entries from previous terms are present on each new leader from the moment of its election, without the need to transfer those entries to the leader. This means that log entries only flow in one direction, from leaders to followers, and leaders never overwrite existing entries in their logs. RequestVote RPC implements this restriction: the RPC includes information about the candidate’s log

Follower and candidate crashes

If a follower or candidate crashes, then future RequestVote and AppendEntries RPCs sent to it will fail. Raft handles these failures by retrying indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after completing an RPC but before responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this causes no harm.

Client interaction

Clients of Raft send all of their requests to the leader. When a client first starts up, it connects to a randomly- chosen server. If the client’s first choice is not the leader, that server will reject the client’s request and supply information about the most recent leader it has heard from (AppendEntries requests include the network address of the leader). If the leader crashes, client requests will time out; clients then try again with randomly-chosen servers.

References

Raft: In Search of an Understandable Consensus Algorithm (Extended Version)

For my new posts, Subscribe via weekly RSS, common RSS