Notes/Replication: Keeping Copies of Data Across Nodes

Replication: Keeping Copies of Data Across Nodes

How distributed systems replicate data using leader-follower, multi-leader, and leaderless architectures, and how quorum reads and writes ensure correctness.

2021-02-21AI-Synthesized from Personal Notes

Distributed SystemsReplication

Terminology

Replication: the process of maintaining copies of the same data on multiple nodes so that the system can tolerate node failures, reduce latency, and increase read throughput
Replica: a node that holds a copy of the data; replicas can serve reads and, depending on the architecture, accept writes
Leader (primary, master): the single node that accepts all write operations in a leader-follower setup; it propagates changes to followers
Follower (secondary, replica): a node that receives replicated data from the leader and can serve read requests but does not accept writes directly
Synchronous replication: the leader waits for at least one follower to confirm it has written the data before acknowledging the write to the client
Asynchronous replication: the leader acknowledges the write to the client immediately after writing locally, without waiting for followers to confirm
Replication lag: the delay between a write being applied on the leader and that write appearing on a follower; measured in time or number of operations
Failover: the process of promoting a follower to leader when the current leader fails
Split-brain: a dangerous condition where two nodes both believe they are the leader, potentially accepting conflicting writes
Write-ahead log (WAL): a sequential log of all write operations that the leader sends to followers for replay; ensures durability and ordering
Quorum: the minimum number of nodes that must participate in a read or write for the operation to be valid; typically a majority
Read-your-writes consistency: a guarantee that after a client writes a value, subsequent reads by that same client will always see the written value or something newer
Monotonic reads: a guarantee that if a client reads a value at time $t$, it will never see an older value in a subsequent read
Conflict resolution: the strategy used to reconcile divergent writes when multiple nodes accept writes independently (e.g., last-writer-wins, merge functions)
Hinted handoff: a technique where a write intended for an unavailable node is temporarily stored on another node and forwarded when the target recovers

What & Why

A single database server is a single point of failure. If it goes down, your application goes down. Replication solves this by keeping copies of data on multiple machines. If one node fails, others can continue serving requests.

Beyond fault tolerance, replication provides two other benefits. First, it reduces read latency by placing replicas closer to users geographically. A user in Tokyo reads from a Tokyo replica instead of crossing the Pacific to a server in Virginia. Second, it increases read throughput by spreading read queries across multiple replicas instead of funneling everything through one machine.

The fundamental challenge of replication is keeping all copies consistent. When data changes on one node, how and when do the other nodes learn about it? The answer depends on the replication architecture. Leader-follower replication funnels all writes through a single node for simplicity. Multi-leader replication allows writes at multiple nodes for better write availability. Leaderless replication lets any node accept writes and uses quorums to ensure consistency. Each approach makes different trade-offs between consistency, availability, latency, and complexity.

How It Works

Leader-Follower Replication

The most common replication architecture. One node is designated the leader. All writes go to the leader, which writes to its local storage and then sends the change to all followers via a replication log (often a write-ahead log). Followers apply changes in the same order the leader produced them.

Reads can go to the leader (for strong consistency) or to any follower (for higher throughput, at the cost of potentially reading stale data due to replication lag).

Synchronous vs. asynchronous: In synchronous replication, the leader waits for at least one follower to confirm before acknowledging the write. This guarantees that at least two nodes have the data, but increases write latency. In asynchronous replication, the leader acknowledges immediately. This is faster but risks data loss if the leader crashes before the follower catches up.

Most production systems use semi-synchronous replication: one follower is synchronous (guaranteeing at least one up-to-date replica), and the rest are asynchronous.

Multi-Leader Replication

Multiple nodes accept writes. Each leader processes writes locally and asynchronously replicates to other leaders. This is useful for multi-datacenter deployments: each datacenter has its own leader, so writes are fast (local) and the system tolerates an entire datacenter going offline.

The major downside is write conflicts. If two leaders concurrently modify the same record, the system must resolve the conflict. Common strategies include last-writer-wins (using timestamps, simple but can lose data), merge functions (application-specific logic to combine conflicting values), and CRDTs (data structures designed to merge automatically without conflicts).

Leaderless Replication

No node is special. Any replica can accept reads and writes. The client sends writes to multiple replicas in parallel and considers the write successful when a write quorum $W$ of $N$ replicas acknowledges it. For reads, the client queries a read quorum $R$ of replicas and takes the most recent value.

Strong consistency is achieved when $W + R > N$, because the read and write quorums must overlap in at least one node that has the latest value.

Anti-entropy and read repair: Since replicas can diverge, leaderless systems use background processes (anti-entropy) to compare replicas and fix inconsistencies. Read repair is a lighter approach: when a client reads from multiple replicas and detects a stale value, it writes the newer value back to the stale replica.

Complexity Analysis

Let $N$ be the total number of replicas, $W$ the write quorum, and $R$ the read quorum.

Architecture	Write Latency	Read Latency	Fault Tolerance
Leader-Follower (sync)	$O(\text{RTT}_{\text{max follower}})$	$O(1)$ local read	Survives follower failures
Leader-Follower (async)	$O(1)$ local write	$O(1)$ but may be stale	Data loss if leader fails
Multi-Leader	$O(1)$ local write	$O(1)$ local read	Survives datacenter failure
Leaderless (quorum)	$O(\text{RTT} \times W)$	$O(\text{RTT} \times R)$	Survives $N - W$ write failures

For quorum-based systems, the consistency guarantee depends on the overlap between read and write quorums:

$W + R > N \implies \text{strong consistency (guaranteed overlap)}$

The number of node failures the system can tolerate:

$\text{Write tolerance} = N - W$

$\text{Read tolerance} = N - R$

For $N = 3, W = 2, R = 2$: the system tolerates 1 node failure for both reads and writes while maintaining strong consistency.

Replication lag in an asynchronous leader-follower system is bounded by:

$\text{Lag} \leq \text{RTT}_{\text{leader} \to \text{follower}} + \text{apply time}$

Under normal conditions this is milliseconds, but during load spikes or network congestion, lag can grow to seconds or even minutes.

Implementation

ALGORITHM LeaderFollowerWrite(leader, followers, key, value)
INPUT: leader: leader node, followers: list of follower nodes, key: data key, value: data value
OUTPUT: success or failure
BEGIN
  // Write to leader's local storage
  lsn <- leader.nextLSN()
  leader.wal.append(lsn, key, value)
  leader.storage.put(key, value)

  // Replicate to followers
  syncFollower <- followers[0]  // first follower is synchronous
  response <- SEND_REPLICATE(syncFollower, lsn, key, value)

  IF response is failure THEN
    RETURN failure("synchronous follower did not acknowledge")
  END IF

  // Async replication to remaining followers
  FOR i <- 1 TO LENGTH(followers) - 1 DO
    ASYNC_SEND_REPLICATE(followers[i], lsn, key, value)
  END FOR

  RETURN success
END

ALGORITHM FollowerApplyReplication(follower, lsn, key, value)
INPUT: follower: this node, lsn: log sequence number, key: data key, value: data value
OUTPUT: acknowledgment
BEGIN
  IF lsn <= follower.lastAppliedLSN THEN
    RETURN ack(duplicate)
  END IF

  IF lsn > follower.lastAppliedLSN + 1 THEN
    REQUEST_MISSING_ENTRIES(leader, follower.lastAppliedLSN + 1, lsn - 1)
  END IF

  follower.wal.append(lsn, key, value)
  follower.storage.put(key, value)
  follower.lastAppliedLSN <- lsn
  RETURN ack(success)
END

ALGORITHM LeaderlessQuorumWrite(client, replicas, key, value, W)
INPUT: client: requesting client, replicas: list of N replica nodes, key: data key, value: data value, W: write quorum
OUTPUT: success or failure
BEGIN
  timestamp <- CURRENT_TIME()
  ackCount <- 0

  FOR EACH replica IN replicas DO
    response <- SEND_WRITE(replica, key, value, timestamp)
    IF response is successful THEN
      ackCount <- ackCount + 1
    END IF
  END FOR

  IF ackCount >= W THEN
    RETURN success
  ELSE
    RETURN failure("write quorum not reached")
  END IF
END

ALGORITHM LeaderlessQuorumRead(client, replicas, key, R)
INPUT: client: requesting client, replicas: list of N replica nodes, key: data key, R: read quorum
OUTPUT: value or failure
BEGIN
  responses <- empty list

  FOR EACH replica IN replicas DO
    response <- SEND_READ(replica, key)
    IF response is successful THEN
      APPEND response TO responses
    END IF
  END FOR

  IF LENGTH(responses) < R THEN
    RETURN failure("read quorum not reached")
  END IF

  // Find the most recent value by timestamp
  latest <- response IN responses WITH MAX(timestamp)

  // Read repair: update stale replicas
  FOR EACH response IN responses DO
    IF response.timestamp < latest.timestamp THEN
      ASYNC_SEND_WRITE(response.node, key, latest.value, latest.timestamp)
    END IF
  END FOR

  RETURN latest.value
END

ALGORITHM Failover(followers, deadLeader)
INPUT: followers: list of follower nodes, deadLeader: the failed leader node
OUTPUT: new leader node
BEGIN
  // Select the follower with the highest LSN (most up-to-date)
  bestFollower <- NIL
  highestLSN <- -1

  FOR EACH follower IN followers DO
    IF follower.lastAppliedLSN > highestLSN THEN
      highestLSN <- follower.lastAppliedLSN
      bestFollower <- follower
    END IF
  END FOR

  bestFollower.role <- LEADER

  // Notify all other followers of the new leader
  FOR EACH follower IN followers DO
    IF follower != bestFollower THEN
      SEND_NEW_LEADER_NOTIFICATION(follower, bestFollower)
    END IF
  END FOR

  RETURN bestFollower
END

Real-World Applications

PostgreSQL streaming replication: uses leader-follower with a write-ahead log shipped to standby replicas; supports synchronous and asynchronous modes for different durability and performance trade-offs
MySQL Group Replication: provides multi-leader replication within a single cluster using a Paxos-based protocol for conflict detection and ordering
Amazon DynamoDB: uses leaderless replication with configurable read and write quorums, allowing applications to tune the consistency-availability trade-off per request
Apache Cassandra: leaderless architecture where any node can accept writes; uses tunable consistency levels (ONE, QUORUM, ALL) and anti-entropy repair for convergence
CouchDB: multi-leader replication designed for offline-first applications; each device has a full replica that syncs when connectivity is available, with automatic conflict detection
MongoDB replica sets: leader-follower replication with automatic failover; the replica set elects a new primary using a Raft-like protocol when the current primary becomes unreachable

Key Takeaways

Replication keeps copies of data on multiple nodes for fault tolerance, lower read latency, and higher read throughput
Leader-follower is the simplest model: one node accepts writes and propagates them to followers; synchronous replication trades latency for durability, asynchronous trades durability for speed
Multi-leader replication allows writes at multiple nodes (useful for multi-datacenter setups) but introduces write conflicts that require resolution strategies
Leaderless replication sends reads and writes to multiple nodes in parallel; strong consistency requires $W + R > N$ so that read and write quorums overlap
Failover in leader-follower systems risks split-brain if the old leader comes back online; fencing mechanisms prevent two leaders from operating simultaneously
Replication lag is unavoidable in asynchronous systems; read-your-writes and monotonic reads are consistency guarantees that help applications cope with lag

2021-02-22

Sharding and Partitioning: Splitting Data Across Nodes

How distributed systems split data across multiple nodes using hash-based and range-based partitioning, consistent hashing, and rebalancing strategies.

2021-02-23

Consistency Models: From Strong to Eventual

Understanding the spectrum of consistency models in distributed systems, from linearizability and sequential consistency to causal and eventual consistency, and the trade-offs each model makes.

2021-02-24

Distributed Transactions: Coordinating Writes Across Services

How distributed systems coordinate multi-node writes using two-phase commit, three-phase commit, and the Saga pattern with compensating transactions.