Raft is a consensus algorithm for managing a replicated log. It decomposes the consensus problem into three independent subproblems as follows.
Raft divides time into terms of arbitrary length, and each term begins with an election, in which one or more server attempt to become the leader. At any given time each server is in one of three states: leader, follower, or candidate.
In normal operation there is exactly one leader and all of the other servers are followers. Followers are passive: they issue no requests on their own but simply respond to requests from leaders and candidates. The leader handles all client requests (if a client contacts a follower, the follower redirects it to the leader). The third state, candidate, is used to elect a new leader
If a candidate wins the election, then it serves as leader for the rest of the term. In some situations an election will result in a split vote. In this case the term will end with no leader; a new term (with a new election) will begin shortly. Raft ensures that there is at most one leader in a given term.
Communication between server happen using remote procedure calls (RPCs), and it requires only two types of RPCs for driving consensus
The first step is to elect a distinguished leader, and then give the complete responsibility for managing the replicated log. The leader accepts log entries from clients, replicates them on other servers, and tells servers when it is safe to apply log entries to their state machines.
Each server stores a current term number, which increases monotonically over time. Current terms are exchanged whenever servers communicate; if one server’s current term is smaller than the other’s, then it updates its current term to the larger value. If a candidate or leader discovers that its term is out of date, it immediately reverts to follower state. If a server receives a request with a stale term number, it rejects the request.
Raft uses a heartbeat mechanism to trigger leader election. When servers start up, they begin as followers. A server
remains in follower state as
long as it receives valid RPCs from a leader or candidate. Leaders send periodic heartbeats
(AppendEntries
RPCs that
carry no log entries) to all
followers in order to maintain their authority. If a follower receives no communication over a period of time called
the election timeout, then it
assumes there is no viable leader and begins an election to choose a new leader. a follower increments its current
term and transitions to
candidate state. - electing the leader which has the recent version of the log(key difference)
A candidate wins an election if it receives votes from a majority of the servers in the full cluster for the same term. Each server will vote for at most one candidate in a given term, on a first-come-first-served basis
While waiting for votes, a candidate may receive an AppendEntries
RPC from another server claiming to
be leader. If
the leader’s term (included in
its RPC) is at least as large as the candidate’s current term, then the candidate recognizes the leader as
legitimate and returns to follower
state. If the term in the RPC is smaller than the candidate’s current term, then the candidate rejects the RPC and
continues in candidate state.
The third possible outcome is that a candidate neither wins nor loses the election: if many followers become
candidates at the same time, votes
could be split so that no candidate obtains a majority. When this happens, each candidate will time out and start a
new election by incrementing
its term and initiating another round of
RequestVote
RPCs. Raft uses randomized election timeouts to ensure that split votes are rare and that
they are resolved quickly.
The leader appends the command to its log as a new entry, then issues AppendEntries
RPCs in parallel to
each of the other servers to
replicate the entry. When the entry has been safely replicated, the leader applies the entry to its state machine
and returns the result of that
execution to the client. If followers crash or run slowly, or if network packets are lost, the leader retries
AppendEntries
RPCs indefinitely (even after it has responded to the client) until all followers
eventually store all log entries.
The leader decides when it is safe to apply a log entry to the state machines; such an entry is called committed. Raft guarantees that committed entries are durable and will eventually be executed by all of the available state machines. A log entry is committed once the leader that created the entry has replicated it on a majority of the servers.
The leader keeps track of the highest index it knows to be committed, and it includes that index in future
AppendEntries
RPCs (including
heartbeats) so that the other servers eventually find out. Once a follower learns that a log entry is committed, it
applies the entry to its local
state machine
In Raft, the leader handles inconsistencies by forcing the followers’ logs to duplicate its own. This means that conflicting entries in follower logs will be overwritten with entries from the leader’s log. To bring a follower’s log into consistency with its own, the leader must find the latest log entry where the two logs agree, delete any entries in the follower’s log after that point, and send the follower all of the leader’s entries after that point.
When a leader first comes to power, it initializes all nextIndex values to the index just after the last one in its
log. If a follower’s log is
inconsistent with the leader’s, the AppendEntries
consistency check will fail in the next AppendEntries
RPC. After a rejection, the leader decrements nextIndex and retries the AppendEntries
RPC. Eventually
nextIndex will reach a point where the leader and follower logs match. When this happens, AppendEntries
will
succeed, which removes any conflicting entries in the follower’s log and appends entries from the leader’s log
(if any). Once AppendEntries
succeeds, the follower’s log is consistent with the leader’s, and it will
remain that way for the rest of the term. A leader never overwrites or deletes entries in its own log
broadcastTime ≪ electionTimeout ≪ MTBF
In order to avoid availability gaps, new servers join the cluster as non-voting members (the leader replicates log entries to them, but they are not considered for majorities). Once the new servers have caught up with the rest of the cluster, it can start to act as leader.
Creating a Snapshot with metadata like the last included index and last entry in log can help in doing the consistency checks in the AppendEntries. The leader must occasionally send snapshots to followers that lag behind.
the leader would create a snapshot, then it would send this snapshot to each of its followers. However, this has two disadvantages.
One simple strategy is to take a snapshot when the log reaches a fixed size in bytes. If this size is set to be significantly larger than the expected size of a snapshot, then the disk bandwidth overhead for snapshotting will be small.
In any leader-based consensus algorithm, the leader must eventually store all of the committed log entries. Raft
uses a simpler approach where it
guarantees that all the committed entries from previous terms are present on each new leader from the moment of its
election, without the need to
transfer those entries to the leader. This means that log entries only flow in one direction, from leaders to
followers, and leaders never overwrite existing entries in their logs. RequestVote
RPC implements this
restriction: the RPC includes information
about the candidate’s log
If a follower or candidate crashes, then future RequestVote
and AppendEntries
RPCs sent to
it will fail. Raft
handles these failures by retrying
indefinitely; if the crashed server restarts, then the RPC will complete successfully. If a server crashes after
completing an RPC but before
responding, then it will receive the same RPC again after it restarts. Raft RPCs are idempotent, so this causes no
harm.
Clients of Raft send all of their requests to the leader. When a client first starts up, it connects to a randomly-
chosen server. If the client’s
first choice is not the leader, that server will reject the client’s request and supply information about the most
recent leader it has heard from
(AppendEntries
requests include the network address of the leader). If the leader crashes, client
requests will time
out; clients then try again
with randomly-chosen servers.
For my new posts, Subscribe via weekly RSS, common RSS