Replication: Keeping Copies of Data Across Nodes
How distributed systems replicate data using leader-follower, multi-leader, and leaderless architectures, and how quorum reads and writes ensure correctness.
Terminology
- Replication: the process of maintaining copies of the same data on multiple nodes so that the system can tolerate node failures, reduce latency, and increase read throughput
- Replica: a node that holds a copy of the data; replicas can serve reads and, depending on the architecture, accept writes
- Leader (primary, master): the single node that accepts all write operations in a leader-follower setup; it propagates changes to followers
- Follower (secondary, replica): a node that receives replicated data from the leader and can serve read requests but does not accept writes directly
- Synchronous replication: the leader waits for at least one follower to confirm it has written the data before acknowledging the write to the client
- Asynchronous replication: the leader acknowledges the write to the client immediately after writing locally, without waiting for followers to confirm
- Replication lag: the delay between a write being applied on the leader and that write appearing on a follower; measured in time or number of operations
- Failover: the process of promoting a follower to leader when the current leader fails
- Split-brain: a dangerous condition where two nodes both believe they are the leader, potentially accepting conflicting writes
- Write-ahead log (WAL): a sequential log of all write operations that the leader sends to followers for replay; ensures durability and ordering
- Quorum: the minimum number of nodes that must participate in a read or write for the operation to be valid; typically a majority
- Read-your-writes consistency: a guarantee that after a client writes a value, subsequent reads by that same client will always see the written value or something newer
- Monotonic reads: a guarantee that if a client reads a value at time $t$, it will never see an older value in a subsequent read
- Conflict resolution: the strategy used to reconcile divergent writes when multiple nodes accept writes independently (e.g., last-writer-wins, merge functions)
- Hinted handoff: a technique where a write intended for an unavailable node is temporarily stored on another node and forwarded when the target recovers
What & Why
A single database server is a single point of failure. If it goes down, your application goes down. Replication solves this by keeping copies of data on multiple machines. If one node fails, others can continue serving requests.
Beyond fault tolerance, replication provides two other benefits. First, it reduces read latency by placing replicas closer to users geographically. A user in Tokyo reads from a Tokyo replica instead of crossing the Pacific to a server in Virginia. Second, it increases read throughput by spreading read queries across multiple replicas instead of funneling everything through one machine.
The fundamental challenge of replication is keeping all copies consistent. When data changes on one node, how and when do the other nodes learn about it? The answer depends on the replication architecture. Leader-follower replication funnels all writes through a single node for simplicity. Multi-leader replication allows writes at multiple nodes for better write availability. Leaderless replication lets any node accept writes and uses quorums to ensure consistency. Each approach makes different trade-offs between consistency, availability, latency, and complexity.
How It Works
Leader-Follower Replication
The most common replication architecture. One node is designated the leader. All writes go to the leader, which writes to its local storage and then sends the change to all followers via a replication log (often a write-ahead log). Followers apply changes in the same order the leader produced them.
Reads can go to the leader (for strong consistency) or to any follower (for higher throughput, at the cost of potentially reading stale data due to replication lag).
Synchronous vs. asynchronous: In synchronous replication, the leader waits for at least one follower to confirm before acknowledging the write. This guarantees that at least two nodes have the data, but increases write latency. In asynchronous replication, the leader acknowledges immediately. This is faster but risks data loss if the leader crashes before the follower catches up.
Most production systems use semi-synchronous replication: one follower is synchronous (guaranteeing at least one up-to-date replica), and the rest are asynchronous.
Multi-Leader Replication
Multiple nodes accept writes. Each leader processes writes locally and asynchronously replicates to other leaders. This is useful for multi-datacenter deployments: each datacenter has its own leader, so writes are fast (local) and the system tolerates an entire datacenter going offline.
The major downside is write conflicts. If two leaders concurrently modify the same record, the system must resolve the conflict. Common strategies include last-writer-wins (using timestamps, simple but can lose data), merge functions (application-specific logic to combine conflicting values), and CRDTs (data structures designed to merge automatically without conflicts).
Leaderless Replication
No node is special. Any replica can accept reads and writes. The client sends writes to multiple replicas in parallel and considers the write successful when a write quorum W of N replicas acknowledges it. For reads, the client queries a read quorum R of replicas and takes the most recent value.
Strong consistency is achieved when W + R > N, because the read and write quorums must overlap in at least one node that has the latest value.
Anti-entropy and read repair: Since replicas can diverge, leaderless systems use background processes (anti-entropy) to compare replicas and fix inconsistencies. Read repair is a lighter approach: when a client reads from multiple replicas and detects a stale value, it writes the newer value back to the stale replica.
Complexity Analysis
Let $N$ be the total number of replicas, $W$ the write quorum, and $R$ the read quorum.
| Architecture | Write Latency | Read Latency | Fault Tolerance |
|---|---|---|---|
| Leader-Follower (sync) | $O(\text{RTT}_{\text{max follower}})$ | $O(1)$ local read | Survives follower failures |
| Leader-Follower (async) | $O(1)$ local write | $O(1)$ but may be stale | Data loss if leader fails |
| Multi-Leader | $O(1)$ local write | $O(1)$ local read | Survives datacenter failure |
| Leaderless (quorum) | $O(\text{RTT} \times W)$ | $O(\text{RTT} \times R)$ | Survives $N - W$ write failures |
For quorum-based systems, the consistency guarantee depends on the overlap between read and write quorums:
The number of node failures the system can tolerate:
For $N = 3, W = 2, R = 2$: the system tolerates 1 node failure for both reads and writes while maintaining strong consistency.
Replication lag in an asynchronous leader-follower system is bounded by:
Under normal conditions this is milliseconds, but during load spikes or network congestion, lag can grow to seconds or even minutes.
Implementation
ALGORITHM LeaderFollowerWrite(leader, followers, key, value)
INPUT: leader: leader node, followers: list of follower nodes, key: data key, value: data value
OUTPUT: success or failure
BEGIN
// Write to leader's local storage
lsn <- leader.nextLSN()
leader.wal.append(lsn, key, value)
leader.storage.put(key, value)
// Replicate to followers
syncFollower <- followers[0] // first follower is synchronous
response <- SEND_REPLICATE(syncFollower, lsn, key, value)
IF response is failure THEN
RETURN failure("synchronous follower did not acknowledge")
END IF
// Async replication to remaining followers
FOR i <- 1 TO LENGTH(followers) - 1 DO
ASYNC_SEND_REPLICATE(followers[i], lsn, key, value)
END FOR
RETURN success
END
ALGORITHM FollowerApplyReplication(follower, lsn, key, value)
INPUT: follower: this node, lsn: log sequence number, key: data key, value: data value
OUTPUT: acknowledgment
BEGIN
IF lsn <= follower.lastAppliedLSN THEN
RETURN ack(duplicate)
END IF
IF lsn > follower.lastAppliedLSN + 1 THEN
REQUEST_MISSING_ENTRIES(leader, follower.lastAppliedLSN + 1, lsn - 1)
END IF
follower.wal.append(lsn, key, value)
follower.storage.put(key, value)
follower.lastAppliedLSN <- lsn
RETURN ack(success)
END
ALGORITHM LeaderlessQuorumWrite(client, replicas, key, value, W)
INPUT: client: requesting client, replicas: list of N replica nodes, key: data key, value: data value, W: write quorum
OUTPUT: success or failure
BEGIN
timestamp <- CURRENT_TIME()
ackCount <- 0
FOR EACH replica IN replicas DO
response <- SEND_WRITE(replica, key, value, timestamp)
IF response is successful THEN
ackCount <- ackCount + 1
END IF
END FOR
IF ackCount >= W THEN
RETURN success
ELSE
RETURN failure("write quorum not reached")
END IF
END
ALGORITHM LeaderlessQuorumRead(client, replicas, key, R)
INPUT: client: requesting client, replicas: list of N replica nodes, key: data key, R: read quorum
OUTPUT: value or failure
BEGIN
responses <- empty list
FOR EACH replica IN replicas DO
response <- SEND_READ(replica, key)
IF response is successful THEN
APPEND response TO responses
END IF
END FOR
IF LENGTH(responses) < R THEN
RETURN failure("read quorum not reached")
END IF
// Find the most recent value by timestamp
latest <- response IN responses WITH MAX(timestamp)
// Read repair: update stale replicas
FOR EACH response IN responses DO
IF response.timestamp < latest.timestamp THEN
ASYNC_SEND_WRITE(response.node, key, latest.value, latest.timestamp)
END IF
END FOR
RETURN latest.value
END
ALGORITHM Failover(followers, deadLeader)
INPUT: followers: list of follower nodes, deadLeader: the failed leader node
OUTPUT: new leader node
BEGIN
// Select the follower with the highest LSN (most up-to-date)
bestFollower <- NIL
highestLSN <- -1
FOR EACH follower IN followers DO
IF follower.lastAppliedLSN > highestLSN THEN
highestLSN <- follower.lastAppliedLSN
bestFollower <- follower
END IF
END FOR
bestFollower.role <- LEADER
// Notify all other followers of the new leader
FOR EACH follower IN followers DO
IF follower != bestFollower THEN
SEND_NEW_LEADER_NOTIFICATION(follower, bestFollower)
END IF
END FOR
RETURN bestFollower
END
Real-World Applications
- PostgreSQL streaming replication: uses leader-follower with a write-ahead log shipped to standby replicas; supports synchronous and asynchronous modes for different durability and performance trade-offs
- MySQL Group Replication: provides multi-leader replication within a single cluster using a Paxos-based protocol for conflict detection and ordering
- Amazon DynamoDB: uses leaderless replication with configurable read and write quorums, allowing applications to tune the consistency-availability trade-off per request
- Apache Cassandra: leaderless architecture where any node can accept writes; uses tunable consistency levels (ONE, QUORUM, ALL) and anti-entropy repair for convergence
- CouchDB: multi-leader replication designed for offline-first applications; each device has a full replica that syncs when connectivity is available, with automatic conflict detection
- MongoDB replica sets: leader-follower replication with automatic failover; the replica set elects a new primary using a Raft-like protocol when the current primary becomes unreachable
Key Takeaways
- Replication keeps copies of data on multiple nodes for fault tolerance, lower read latency, and higher read throughput
- Leader-follower is the simplest model: one node accepts writes and propagates them to followers; synchronous replication trades latency for durability, asynchronous trades durability for speed
- Multi-leader replication allows writes at multiple nodes (useful for multi-datacenter setups) but introduces write conflicts that require resolution strategies
- Leaderless replication sends reads and writes to multiple nodes in parallel; strong consistency requires $W + R > N$ so that read and write quorums overlap
- Failover in leader-follower systems risks split-brain if the old leader comes back online; fencing mechanisms prevent two leaders from operating simultaneously
- Replication lag is unavoidable in asynchronous systems; read-your-writes and monotonic reads are consistency guarantees that help applications cope with lag