Back to Blog

Consensus Algorithms: Getting Distributed Nodes to Agree

How Paxos and Raft solve the consensus problem in distributed systems, enabling multiple nodes to agree on a single value even when some nodes fail.

2021-02-20
Share
Distributed Systemsconsensuspaxosraft

Terminology

  • Consensus: the process by which a group of distributed nodes agrees on a single value or decision, even when some nodes fail or messages are lost
  • Leader: a designated node that coordinates proposals and commits in a consensus protocol; other nodes (followers) accept the leader's decisions
  • Follower: a node that replicates the leader's log and responds to the leader's requests; it does not initiate proposals on its own
  • Candidate: in Raft, a node that is attempting to become the new leader by requesting votes from other nodes
  • Term: a logical time period in Raft; each term begins with an election and has at most one leader; terms are monotonically increasing integers
  • Proposal: a suggested value submitted by a proposer in Paxos; it carries a unique proposal number that determines priority
  • Quorum: a majority of nodes ($\lfloor n/2 \rfloor + 1$ out of $n$) that must agree for a decision to be committed
  • Log: an ordered sequence of entries (commands or values) that each node maintains; consensus ensures all nodes eventually have the same log
  • Commit: the point at which a log entry is considered durable and safe to apply; an entry is committed once a quorum of nodes has replicated it
  • Split vote: a situation where no candidate receives a majority of votes in an election, requiring a new election round
  • Heartbeat: a periodic message sent by the leader to followers to assert its authority and prevent unnecessary elections
  • Proposer: in Paxos, the node that initiates a proposal by sending prepare requests to acceptors
  • Acceptor: in Paxos, a node that votes on proposals; it promises not to accept proposals with lower numbers than the highest it has seen
  • Learner: in Paxos, a node that learns the chosen value once a quorum of acceptors has accepted a proposal
  • Safety: the guarantee that the system never reaches an incorrect state; in consensus, at most one value is chosen
  • Liveness: the guarantee that the system eventually makes progress; in consensus, some proposed value is eventually chosen

What & Why

In a distributed system, multiple nodes need to agree on things: which node is the leader, what order to apply transactions, whether to commit or abort. This is the consensus problem. It sounds simple, but it is one of the hardest problems in distributed computing because nodes can crash, messages can be delayed or lost, and there is no global clock.

The FLP impossibility result (Fischer, Lynch, Paterson, 1985) proved that no deterministic consensus algorithm can guarantee progress in an asynchronous system where even one node can crash. This means every practical consensus algorithm must make trade-offs, typically by using timeouts to detect failures (introducing a partial synchrony assumption).

Paxos, invented by Leslie Lamport in 1989, was the first practical consensus algorithm. It is provably correct but notoriously difficult to understand and implement. Raft, designed by Diego Ongaro and John Ousterhout in 2014, solves the same problem with an emphasis on understandability. Raft decomposes consensus into leader election, log replication, and safety, making each piece easier to reason about independently.

These algorithms are the foundation of systems like etcd (used by Kubernetes), ZooKeeper, CockroachDB, and TiKV. Any time you need multiple machines to agree on something reliably, you need consensus.

How It Works

Paxos: The Two-Phase Approach

Paxos operates in two phases. In Phase 1 (Prepare), a proposer selects a proposal number n and sends a Prepare(n) message to a quorum of acceptors. Each acceptor responds with a promise not to accept any proposal numbered less than n, and includes the highest-numbered proposal it has already accepted (if any).

In Phase 2 (Accept), if the proposer receives promises from a quorum, it sends an Accept(n, v) message where v is either the value from the highest-numbered previously accepted proposal, or the proposer's own value if no acceptor had accepted anything. If a quorum of acceptors accepts this proposal, the value is chosen.

Raft: Leader-Based Consensus

Raft Node State Transitions Follower Candidate Leader timeout, start election wins majority discovers higher term split vote, new election discovers leader Replicates log from leader Requests votes from peers Sends heartbeats, replicates entries

Raft divides consensus into three sub-problems:

Leader Election: All nodes start as followers. If a follower does not hear from a leader within a randomized timeout (typically 150-300ms), it becomes a candidate, increments its term, votes for itself, and requests votes from other nodes. If it receives votes from a majority, it becomes the leader. If another node wins or a higher term is discovered, the candidate reverts to follower. Randomized timeouts prevent repeated split votes.

Log Replication: The leader receives client requests, appends them to its log, and sends AppendEntries RPCs to all followers. Once a majority of followers have replicated an entry, the leader commits it and responds to the client. Followers apply committed entries to their state machines in log order.

Safety: Raft guarantees that if a log entry is committed, it will be present in the logs of all future leaders. This is enforced by the election restriction: a candidate can only win an election if its log is at least as up-to-date as a majority of nodes. "Up-to-date" means the candidate's last log entry has a higher term, or the same term with a higher index.

Leader Election Timeline

A typical Raft election proceeds as follows: the leader stops sending heartbeats (it crashed), a follower's election timeout fires, it becomes a candidate for term T+1, sends RequestVote RPCs, receives majority votes, and begins sending heartbeats as the new leader. The entire process typically completes in under a second.

Complexity Analysis

Let $n$ be the number of nodes in the cluster.

Operation Messages Rounds Notes
Paxos Prepare $O(n)$ 1 round-trip Proposer to quorum
Paxos Accept $O(n)$ 1 round-trip Proposer to quorum
Paxos Total (single value) $O(n)$ 2 round-trips Can pipeline with Multi-Paxos
Raft Election $O(n)$ 1 round-trip RequestVote to all peers
Raft Log Replication $O(n)$ 1 round-trip AppendEntries to all followers
Raft Heartbeat $O(n)$ per interval Continuous Empty AppendEntries

Both Paxos and Raft require a quorum of $\lfloor n/2 \rfloor + 1$ nodes to make progress. The system tolerates up to $f$ failures where:

$f = \lfloor (n - 1) / 2 \rfloor$

For a 5-node cluster: $f = 2$ (tolerates 2 failures). For a 3-node cluster: $f = 1$.

The expected number of election rounds before a leader is elected in Raft, given randomized timeouts in range $[T, 2T]$:

$E[\text{rounds}] \approx 1 + \frac{1}{n} \text{ (with sufficient timeout spread)}$

In practice, split votes are rare with well-chosen timeout ranges, and elections complete in a single round the vast majority of the time.

Implementation

ALGORITHM PaxosPrepare(proposer, acceptors, proposalNum)
INPUT: proposer: this node, acceptors: list of acceptor nodes, proposalNum: unique proposal number
OUTPUT: (promises, highestAccepted) or failure
BEGIN
  promises <- 0
  highestAccepted <- NIL
  highestAcceptedNum <- -1
  quorum <- FLOOR(LENGTH(acceptors) / 2) + 1

  FOR EACH acceptor IN acceptors DO
    response <- SEND_PREPARE(acceptor, proposalNum)
    IF response.promised THEN
      promises <- promises + 1
      IF response.acceptedNum > highestAcceptedNum THEN
        highestAcceptedNum <- response.acceptedNum
        highestAccepted <- response.acceptedValue
      END IF
    END IF
  END FOR

  IF promises >= quorum THEN
    RETURN (promises, highestAccepted)
  ELSE
    RETURN failure
  END IF
END

ALGORITHM PaxosAccept(proposer, acceptors, proposalNum, value)
INPUT: proposer: this node, acceptors: list of acceptors, proposalNum: proposal number, value: proposed value
OUTPUT: success or failure
BEGIN
  accepts <- 0
  quorum <- FLOOR(LENGTH(acceptors) / 2) + 1

  FOR EACH acceptor IN acceptors DO
    response <- SEND_ACCEPT(acceptor, proposalNum, value)
    IF response.accepted THEN
      accepts <- accepts + 1
    END IF
  END FOR

  IF accepts >= quorum THEN
    NOTIFY_LEARNERS(value)
    RETURN success
  ELSE
    RETURN failure
  END IF
END

ALGORITHM AcceptorHandlePrepare(acceptor, proposalNum)
INPUT: acceptor: this node's state, proposalNum: received proposal number
OUTPUT: promise response
BEGIN
  IF proposalNum > acceptor.highestPromised THEN
    acceptor.highestPromised <- proposalNum
    RETURN {
      promised: TRUE,
      acceptedNum: acceptor.acceptedProposalNum,
      acceptedValue: acceptor.acceptedValue
    }
  ELSE
    RETURN { promised: FALSE }
  END IF
END

ALGORITHM RaftLeaderElection(node, peers, currentTerm)
INPUT: node: this node, peers: list of peer nodes, currentTerm: current term number
OUTPUT: new role (leader or follower)
BEGIN
  node.role <- CANDIDATE
  node.currentTerm <- currentTerm + 1
  node.votedFor <- node.id
  votes <- 1  // vote for self
  quorum <- FLOOR((LENGTH(peers) + 1) / 2) + 1

  FOR EACH peer IN peers DO
    response <- SEND_REQUEST_VOTE(peer, node.currentTerm, node.lastLogIndex, node.lastLogTerm)
    IF response.voteGranted THEN
      votes <- votes + 1
    END IF
    IF response.term > node.currentTerm THEN
      node.currentTerm <- response.term
      node.role <- FOLLOWER
      RETURN FOLLOWER
    END IF
  END FOR

  IF votes >= quorum THEN
    node.role <- LEADER
    SEND_HEARTBEATS(peers, node.currentTerm)
    RETURN LEADER
  ELSE
    node.role <- FOLLOWER
    RETURN FOLLOWER
  END IF
END

ALGORITHM RaftAppendEntries(leader, followers, entry)
INPUT: leader: leader node, followers: list of follower nodes, entry: log entry to replicate
OUTPUT: committed or not
BEGIN
  APPEND entry TO leader.log
  ackCount <- 1  // leader counts itself
  quorum <- FLOOR((LENGTH(followers) + 1) / 2) + 1

  FOR EACH follower IN followers DO
    response <- SEND_APPEND_ENTRIES(follower, leader.currentTerm, entry, leader.commitIndex)
    IF response.success THEN
      ackCount <- ackCount + 1
    ELSE IF response.term > leader.currentTerm THEN
      leader.role <- FOLLOWER
      leader.currentTerm <- response.term
      RETURN not committed
    END IF
  END FOR

  IF ackCount >= quorum THEN
    leader.commitIndex <- leader.commitIndex + 1
    RETURN committed
  ELSE
    RETURN not committed
  END IF
END

Real-World Applications

  • etcd and Kubernetes: etcd uses Raft to maintain a consistent key-value store that Kubernetes relies on for cluster state, service discovery, and configuration management
  • Apache ZooKeeper: uses a Paxos-derived protocol called ZAB (ZooKeeper Atomic Broadcast) to coordinate distributed applications, manage leader election, and maintain configuration
  • CockroachDB and TiKV: these distributed databases use Raft to replicate data across nodes, ensuring that committed transactions survive node failures
  • Google Chubby: a distributed lock service built on Paxos that provides coarse-grained locking and reliable storage for small files, used internally at Google for leader election and configuration
  • Consul: HashiCorp's service mesh and configuration tool uses Raft for consistent storage of service catalog data and health check results
  • Blockchain: while not using Paxos or Raft directly, blockchain systems solve a variant of consensus (Byzantine fault tolerance) where nodes may be actively malicious, not just faulty

Key Takeaways

  • Consensus allows distributed nodes to agree on a single value even when some nodes fail; it is the foundation of reliable distributed systems
  • Paxos uses a two-phase protocol (Prepare and Accept) with proposal numbers to ensure safety; it is provably correct but complex to implement
  • Raft decomposes consensus into leader election, log replication, and safety, making it easier to understand and implement than Paxos
  • Both algorithms require a quorum of $\lfloor n/2 \rfloor + 1$ nodes and tolerate up to $\lfloor (n-1)/2 \rfloor$ failures
  • Raft uses randomized election timeouts to avoid split votes, and heartbeats to maintain leader authority
  • The FLP impossibility result means no consensus algorithm can guarantee liveness in a purely asynchronous system; practical systems use timeouts to work around this