Notes/CAP Theorem: The Fundamental Trade-Off in Distributed Systems

CAP Theorem: The Fundamental Trade-Off in Distributed Systems

Understanding the CAP theorem and why every distributed system must choose between consistency, availability, and partition tolerance when network failures occur.

2021-02-19AI-Synthesized from Personal Notes

Distributed SystemsCap Theorem

Terminology

Distributed system: a collection of independent computers (nodes) that communicate over a network and coordinate to appear as a single coherent system to end users
Consistency (C): every read receives the most recent write or an error; all nodes see the same data at the same time
Availability (A): every request to a non-failing node receives a response, without guarantee that it contains the most recent write
Partition tolerance (P): the system continues to operate despite arbitrary message loss or delay between nodes caused by network failures
Network partition: a break in communication between two or more groups of nodes in a distributed system, where nodes within each group can communicate but nodes across groups cannot
Linearizability: the strongest form of consistency, where operations appear to execute atomically and in real-time order across all nodes
Eventual consistency: a consistency model where, given no new updates, all replicas will eventually converge to the same value
Quorum: the minimum number of nodes that must agree on an operation for it to be considered successful; typically a majority of the total nodes
Replica: a copy of data stored on a different node, maintained to improve availability and fault tolerance
Split-brain: a condition where a network partition causes two or more groups of nodes to independently believe they are the authoritative source, potentially leading to conflicting updates
Failover: the process of automatically switching to a backup node or replica when the primary node fails
Latency: the time delay between a client sending a request and receiving a response

What & Why

Every distributed system runs on multiple machines connected by a network. Networks are unreliable: cables get cut, switches fail, and packets get dropped. The CAP theorem, proved by Seth Gilbert and Nancy Lynch in 2002 (based on Eric Brewer's 2000 conjecture), states a fundamental truth about this reality: a distributed data store cannot simultaneously provide all three of consistency, availability, and partition tolerance. When a network partition occurs, the system must choose between consistency and availability.

This matters because partition tolerance is not optional. Networks will fail. The real engineering question is: when a partition happens, does your system refuse to respond (preserving consistency) or does it respond with potentially stale data (preserving availability)?

Understanding CAP is the starting point for reasoning about distributed system design. It explains why databases like traditional relational systems with synchronous replication prioritize consistency (CP), why systems like Cassandra and DynamoDB default to availability (AP), and why no system can promise all three guarantees at once during a partition. The theorem does not say you must always sacrifice one property. During normal operation (no partition), a system can be both consistent and available. The trade-off only manifests when the network breaks.

How It Works

The Three Guarantees

Consistency means that after a write completes, every subsequent read (from any node) returns that written value or a newer one. If a client writes "balance = 100" and the write is acknowledged, no other client should ever read an older value like "balance = 50" from any node in the system.

Availability means that every request to a functioning node gets a response. The system never tells a healthy client "I cannot answer right now." The response might contain stale data, but it always responds.

Partition tolerance means the system keeps working even when some messages between nodes are lost or delayed indefinitely. Since real networks always have this risk, every practical distributed system must be partition-tolerant.

The Trade-Off During a Partition

When a network partition splits nodes into two groups, the system faces a choice:

Choose Consistency (CP): Nodes on the minority side of the partition stop accepting writes (or all requests) until the partition heals. This guarantees that no stale reads occur, but some clients get errors or timeouts. Systems like HBase, MongoDB (with majority write concern), and ZooKeeper take this approach.

Choose Availability (AP): Both sides of the partition continue accepting reads and writes independently. Clients always get a response, but the two sides may diverge. When the partition heals, the system must reconcile conflicting updates. Systems like Cassandra, DynamoDB, and CouchDB take this approach.

CA (no partition tolerance): This only works on a single machine or a network that never fails. Since real networks always risk partitions, CA is not a practical choice for distributed systems. A single-node PostgreSQL database is technically CA, but the moment you add replication across a network, you must handle partitions.

CP System Behavior During a Partition

When a partition occurs in a CP system, the nodes that cannot reach a majority (quorum) refuse to serve requests. A client connecting to an isolated minority node receives an error. The majority side continues operating normally. Once the partition heals, the minority nodes sync with the majority and resume serving requests.

AP System Behavior During a Partition

When a partition occurs in an AP system, all nodes continue serving requests. A write to node A on one side of the partition is not visible to node B on the other side. Both sides may accept conflicting writes to the same key. When the partition heals, the system uses a conflict resolution strategy (last-writer-wins, vector clocks, or application-level merge) to reconcile the divergent data.

Complexity Analysis

The CAP theorem does not prescribe algorithms with traditional time complexity, but the trade-offs have measurable costs in terms of latency, message overhead, and recovery time.

System Type	Write Latency	Read Consistency	Partition Behavior
CP (sync replication)	$O(\text{RTT} \times \lceil n/2 \rceil)$	Linearizable	Minority unavailable
AP (async replication)	$O(\text{RTT} \times 1)$	Eventual	All nodes respond
Quorum (tunable)	$O(\text{RTT} \times W)$	Strong if $W + R > N$	Depends on quorum overlap

For a quorum-based system with $N$ replicas, write quorum $W$, and read quorum $R$:

$W + R > N \implies \text{strong consistency}$

This is because any read quorum and write quorum must overlap in at least one node, guaranteeing the read sees the latest write.

$\text{Overlap} = W + R - N \geq 1$

Common configurations for $N = 3$:

$W = 2, R = 2 \implies \text{overlap} = 1 \text{ (strong consistency)}$

$W = 1, R = 1 \implies \text{overlap} = -1 \text{ (eventual consistency)}$

The availability during a partition depends on whether the surviving nodes can still form a quorum. If $k$ nodes are unreachable:

$\text{Writes available if } N - k \geq W$

$\text{Reads available if } N - k \geq R$

Implementation

ALGORITHM CPWriteWithQuorum(key, value, replicas, W)
INPUT: key: data key, value: data to write, replicas: list of N nodes, W: write quorum
OUTPUT: success or error
BEGIN
  ackCount <- 0
  timestamp <- CURRENT_TIME()

  FOR EACH node IN replicas DO
    response <- SEND_WRITE(node, key, value, timestamp)
    IF response is successful THEN
      ackCount <- ackCount + 1
    END IF
  END FOR

  IF ackCount >= W THEN
    RETURN success
  ELSE
    RETURN error("insufficient replicas acknowledged")
  END IF
END

ALGORITHM CPReadWithQuorum(key, replicas, R)
INPUT: key: data key, replicas: list of N nodes, R: read quorum
OUTPUT: value or error
BEGIN
  responses <- empty list

  FOR EACH node IN replicas DO
    response <- SEND_READ(node, key)
    IF response is successful THEN
      APPEND response TO responses
    END IF
  END FOR

  IF LENGTH(responses) < R THEN
    RETURN error("insufficient replicas responded")
  END IF

  // Return the value with the highest timestamp
  latest <- response in responses WITH MAX(timestamp)
  RETURN latest.value
END

ALGORITHM APWriteAlwaysAccept(key, value, localNode, peers)
INPUT: key: data key, value: data to write, localNode: this node, peers: other replicas
OUTPUT: always returns success
BEGIN
  timestamp <- CURRENT_TIME()
  nodeID <- localNode.id

  // Write locally first, always succeeds
  STORE(localNode, key, value, timestamp, nodeID)

  // Attempt async replication to peers (best effort)
  FOR EACH peer IN peers DO
    ASYNC_SEND_WRITE(peer, key, value, timestamp, nodeID)
    // Do not wait for acknowledgment
  END FOR

  RETURN success
END

ALGORITHM LastWriterWinsResolve(conflictingVersions)
INPUT: conflictingVersions: list of (value, timestamp, nodeID) tuples
OUTPUT: resolved value
BEGIN
  // Sort by timestamp descending, break ties by nodeID
  SORT conflictingVersions BY timestamp DESC, nodeID DESC
  RETURN conflictingVersions[0].value
END

ALGORITHM PartitionDetection(node, peers, timeout)
INPUT: node: this node, peers: list of peer nodes, timeout: heartbeat timeout
OUTPUT: set of reachable and unreachable peers
BEGIN
  reachable <- empty set
  unreachable <- empty set

  FOR EACH peer IN peers DO
    response <- SEND_HEARTBEAT(peer) WITH TIMEOUT timeout
    IF response received THEN
      ADD peer TO reachable
    ELSE
      ADD peer TO unreachable
    END IF
  END FOR

  IF LENGTH(unreachable) > 0 THEN
    LOG("Potential partition detected: unreachable nodes = " + unreachable)
  END IF

  RETURN (reachable, unreachable)
END

Real-World Applications

Banking and financial systems: these systems typically choose CP because showing an incorrect account balance is unacceptable; during a partition, the system returns errors rather than stale data
Social media feeds: platforms like Twitter and Facebook favor AP because a slightly stale feed is acceptable, and the system must always respond to billions of daily requests
Shopping carts: Amazon's Dynamo paper famously chose AP for shopping carts, reasoning that a customer should always be able to add items even during a partition, with conflicts resolved later
DNS: the Domain Name System is an AP system; DNS servers return cached (potentially stale) records and always respond, with updates propagating eventually
Distributed databases: systems like CockroachDB and Google Spanner choose CP with techniques like synchronized clocks (TrueTime) to minimize the availability cost of strong consistency
Configuration management: systems like ZooKeeper and etcd choose CP because configuration data must be consistent across all nodes to avoid split-brain scenarios

Key Takeaways

The CAP theorem states that a distributed system cannot simultaneously guarantee consistency, availability, and partition tolerance; during a network partition, it must choose between C and A
Partition tolerance is not optional in real distributed systems because networks always risk failure, so the practical choice is between CP and AP
CP systems (HBase, ZooKeeper, MongoDB with majority concern) sacrifice availability during partitions to ensure all reads return the latest write
AP systems (Cassandra, DynamoDB, CouchDB) sacrifice consistency during partitions to ensure every healthy node always responds
Quorum-based systems offer tunable consistency: strong consistency when $W + R > N$, eventual consistency otherwise
The CAP trade-off only applies during partitions; in normal operation, a well-designed system can provide both consistency and availability

2021-02-20

Consensus Algorithms: Getting Distributed Nodes to Agree

How Paxos and Raft solve the consensus problem in distributed systems, enabling multiple nodes to agree on a single value even when some nodes fail.

2021-02-21

Replication: Keeping Copies of Data Across Nodes

How distributed systems replicate data using leader-follower, multi-leader, and leaderless architectures, and how quorum reads and writes ensure correctness.

2021-02-22

Sharding and Partitioning: Splitting Data Across Nodes

How distributed systems split data across multiple nodes using hash-based and range-based partitioning, consistent hashing, and rebalancing strategies.