Consensus Algorithms: Getting Distributed Nodes to Agree
How Paxos and Raft solve the consensus problem in distributed systems, enabling multiple nodes to agree on a single value even when some nodes fail.
Terminology
- Consensus: the process by which a group of distributed nodes agrees on a single value or decision, even when some nodes fail or messages are lost
- Leader: a designated node that coordinates proposals and commits in a consensus protocol; other nodes (followers) accept the leader's decisions
- Follower: a node that replicates the leader's log and responds to the leader's requests; it does not initiate proposals on its own
- Candidate: in Raft, a node that is attempting to become the new leader by requesting votes from other nodes
- Term: a logical time period in Raft; each term begins with an election and has at most one leader; terms are monotonically increasing integers
- Proposal: a suggested value submitted by a proposer in Paxos; it carries a unique proposal number that determines priority
- Quorum: a majority of nodes ($\lfloor n/2 \rfloor + 1$ out of $n$) that must agree for a decision to be committed
- Log: an ordered sequence of entries (commands or values) that each node maintains; consensus ensures all nodes eventually have the same log
- Commit: the point at which a log entry is considered durable and safe to apply; an entry is committed once a quorum of nodes has replicated it
- Split vote: a situation where no candidate receives a majority of votes in an election, requiring a new election round
- Heartbeat: a periodic message sent by the leader to followers to assert its authority and prevent unnecessary elections
- Proposer: in Paxos, the node that initiates a proposal by sending prepare requests to acceptors
- Acceptor: in Paxos, a node that votes on proposals; it promises not to accept proposals with lower numbers than the highest it has seen
- Learner: in Paxos, a node that learns the chosen value once a quorum of acceptors has accepted a proposal
- Safety: the guarantee that the system never reaches an incorrect state; in consensus, at most one value is chosen
- Liveness: the guarantee that the system eventually makes progress; in consensus, some proposed value is eventually chosen
What & Why
In a distributed system, multiple nodes need to agree on things: which node is the leader, what order to apply transactions, whether to commit or abort. This is the consensus problem. It sounds simple, but it is one of the hardest problems in distributed computing because nodes can crash, messages can be delayed or lost, and there is no global clock.
The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proved that no deterministic consensus algorithm can guarantee progress in an asynchronous system where even one node can crash. This means every practical consensus algorithm must make trade-offs, typically by using timeouts to detect failures (introducing a partial synchrony assumption).
Paxos, invented by Leslie Lamport in 1989, was the first practical consensus algorithm. It is provably correct but notoriously difficult to understand and implement. Raft, designed by Diego Ongaro and John Ousterhout in 2014, solves the same problem with an emphasis on understandability. Raft decomposes consensus into leader election, log replication, and safety, making each piece easier to reason about independently.
These algorithms are the foundation of systems like etcd (used by Kubernetes), ZooKeeper, CockroachDB, and TiKV. Any time you need multiple machines to agree on something reliably, you need consensus.
How It Works
Paxos: The Two-Phase Approach
Paxos operates in two phases. In Phase 1 (Prepare), a proposer selects a proposal number n and sends a Prepare(n) message to a quorum of acceptors. Each acceptor responds with a promise not to accept any proposal numbered less than n, and includes the highest-numbered proposal it has already accepted (if any).
In Phase 2 (Accept), if the proposer receives promises from a quorum, it sends an Accept(n, v) message where v is either the value from the highest-numbered previously accepted proposal, or the proposer's own value if no acceptor had accepted anything. If a quorum of acceptors accepts this proposal, the value is chosen.
Raft: Leader-Based Consensus
Raft divides consensus into three sub-problems:
Leader Election: All nodes start as followers. If a follower does not hear from a leader within a randomized timeout (typically 150-300ms), it becomes a candidate, increments its term, votes for itself, and requests votes from other nodes. If it receives votes from a majority, it becomes the leader. If another node wins or a higher term is discovered, the candidate reverts to follower. Randomized timeouts prevent repeated split votes.
Log Replication: The leader receives client requests, appends them to its log, and sends AppendEntries RPCs to all followers. Once a majority of followers have replicated an entry, the leader commits it and responds to the client. Followers apply committed entries to their state machines in log order.
Safety: Raft guarantees that if a log entry is committed, it will be present in the logs of all future leaders. This is enforced by the election restriction: a candidate can only win an election if its log is at least as up-to-date as a majority of nodes. "Up-to-date" means the candidate's last log entry has a higher term, or the same term with a higher index.
Leader Election Timeline
A typical Raft election proceeds as follows: the leader stops sending heartbeats (it crashed), a follower's election timeout fires, it becomes a candidate for term T+1, sends RequestVote RPCs, receives majority votes, and begins sending heartbeats as the new leader. The entire process typically completes in under a second.
Complexity Analysis
Let $n$ be the number of nodes in the cluster.
| Operation | Messages | Rounds | Notes |
|---|---|---|---|
| Paxos Prepare | $O(n)$ | 1 round-trip | Proposer to quorum |
| Paxos Accept | $O(n)$ | 1 round-trip | Proposer to quorum |
| Paxos Total (single value) | $O(n)$ | 2 round-trips | Can pipeline with Multi-Paxos |
| Raft Election | $O(n)$ | 1 round-trip | RequestVote to all peers |
| Raft Log Replication | $O(n)$ | 1 round-trip | AppendEntries to all followers |
| Raft Heartbeat | $O(n)$ per interval | Continuous | Empty AppendEntries |
Both Paxos and Raft require a quorum of $\lfloor n/2 \rfloor + 1$ nodes to make progress. The system tolerates up to $f$ failures where:
For a 5-node cluster: $f = 2$ (tolerates 2 failures). For a 3-node cluster: $f = 1$.
The expected number of election rounds before a leader is elected in Raft, given randomized timeouts in range $[T, 2T]$:
In practice, split votes are rare with well-chosen timeout ranges, and elections complete in a single round the vast majority of the time.
Implementation
ALGORITHM PaxosPrepare(proposer, acceptors, proposalNum)
INPUT: proposer: this node, acceptors: list of acceptor nodes, proposalNum: unique proposal number
OUTPUT: (promises, highestAccepted) or failure
BEGIN
promises <- 0
highestAccepted <- NIL
highestAcceptedNum <- -1
quorum <- FLOOR(LENGTH(acceptors) / 2) + 1
FOR EACH acceptor IN acceptors DO
response <- SEND_PREPARE(acceptor, proposalNum)
IF response.promised THEN
promises <- promises + 1
IF response.acceptedNum > highestAcceptedNum THEN
highestAcceptedNum <- response.acceptedNum
highestAccepted <- response.acceptedValue
END IF
END IF
END FOR
IF promises >= quorum THEN
RETURN (promises, highestAccepted)
ELSE
RETURN failure
END IF
END
ALGORITHM PaxosAccept(proposer, acceptors, proposalNum, value)
INPUT: proposer: this node, acceptors: list of acceptors, proposalNum: proposal number, value: proposed value
OUTPUT: success or failure
BEGIN
accepts <- 0
quorum <- FLOOR(LENGTH(acceptors) / 2) + 1
FOR EACH acceptor IN acceptors DO
response <- SEND_ACCEPT(acceptor, proposalNum, value)
IF response.accepted THEN
accepts <- accepts + 1
END IF
END FOR
IF accepts >= quorum THEN
NOTIFY_LEARNERS(value)
RETURN success
ELSE
RETURN failure
END IF
END
ALGORITHM AcceptorHandlePrepare(acceptor, proposalNum)
INPUT: acceptor: this node's state, proposalNum: received proposal number
OUTPUT: promise response
BEGIN
IF proposalNum > acceptor.highestPromised THEN
acceptor.highestPromised <- proposalNum
RETURN {
promised: TRUE,
acceptedNum: acceptor.acceptedProposalNum,
acceptedValue: acceptor.acceptedValue
}
ELSE
RETURN { promised: FALSE }
END IF
END
ALGORITHM RaftLeaderElection(node, peers, currentTerm)
INPUT: node: this node, peers: list of peer nodes, currentTerm: current term number
OUTPUT: new role (leader or follower)
BEGIN
node.role <- CANDIDATE
node.currentTerm <- currentTerm + 1
node.votedFor <- node.id
votes <- 1 // vote for self
quorum <- FLOOR((LENGTH(peers) + 1) / 2) + 1
FOR EACH peer IN peers DO
response <- SEND_REQUEST_VOTE(peer, node.currentTerm, node.lastLogIndex, node.lastLogTerm)
IF response.voteGranted THEN
votes <- votes + 1
END IF
IF response.term > node.currentTerm THEN
node.currentTerm <- response.term
node.role <- FOLLOWER
RETURN FOLLOWER
END IF
END FOR
IF votes >= quorum THEN
node.role <- LEADER
SEND_HEARTBEATS(peers, node.currentTerm)
RETURN LEADER
ELSE
node.role <- FOLLOWER
RETURN FOLLOWER
END IF
END
ALGORITHM RaftAppendEntries(leader, followers, entry)
INPUT: leader: leader node, followers: list of follower nodes, entry: log entry to replicate
OUTPUT: committed or not
BEGIN
APPEND entry TO leader.log
ackCount <- 1 // leader counts itself
quorum <- FLOOR((LENGTH(followers) + 1) / 2) + 1
FOR EACH follower IN followers DO
response <- SEND_APPEND_ENTRIES(follower, leader.currentTerm, entry, leader.commitIndex)
IF response.success THEN
ackCount <- ackCount + 1
ELSE IF response.term > leader.currentTerm THEN
leader.role <- FOLLOWER
leader.currentTerm <- response.term
RETURN not committed
END IF
END FOR
IF ackCount >= quorum THEN
leader.commitIndex <- leader.commitIndex + 1
RETURN committed
ELSE
RETURN not committed
END IF
END
Real-World Applications
- etcd and Kubernetes: etcd uses Raft to maintain a consistent key-value store that Kubernetes relies on for cluster state, service discovery, and configuration management
- Apache ZooKeeper: uses a Paxos-derived protocol called ZAB (ZooKeeper Atomic Broadcast) to coordinate distributed applications, manage leader election, and maintain configuration
- CockroachDB and TiKV: these distributed databases use Raft to replicate data across nodes, ensuring that committed transactions survive node failures
- Google Chubby: a distributed lock service built on Paxos that provides coarse-grained locking and reliable storage for small files, used internally at Google for leader election and configuration
- Consul: HashiCorp's service mesh and configuration tool uses Raft for consistent storage of service catalog data and health check results
- Blockchain: while not using Paxos or Raft directly, blockchain systems solve a variant of consensus (Byzantine fault tolerance) where nodes may be actively malicious, not just faulty
Key Takeaways
- Consensus allows distributed nodes to agree on a single value even when some nodes fail; it is the foundation of reliable distributed systems
- Paxos uses a two-phase protocol (Prepare and Accept) with proposal numbers to ensure safety; it is provably correct but complex to implement
- Raft decomposes consensus into leader election, log replication, and safety, making it easier to understand and implement than Paxos
- Both algorithms require a quorum of $\lfloor n/2 \rfloor + 1$ nodes and tolerate up to $\lfloor (n-1)/2 \rfloor$ failures
- Raft uses randomized election timeouts to avoid split votes, and heartbeats to maintain leader authority
- The FLP impossibility result means no consensus algorithm can guarantee liveness in a purely asynchronous system; practical systems use timeouts to work around this