Notes/Vector Clocks and Conflict Resolution: Tracking Causality in Distributed Systems

Vector Clocks and Conflict Resolution: Tracking Causality in Distributed Systems

How vector clocks, version vectors, and CRDTs track causality and resolve conflicts in distributed systems where there is no global clock.

2021-02-25AI-Synthesized from Personal Notes

Distributed SystemsVector ClocksConflict Resolution

Terminology

Logical clock: a mechanism for ordering events in a distributed system without relying on synchronized physical clocks; assigns a number to each event that respects causality
Lamport timestamp: a single integer counter that increments on each event; if event $A$ happens before event $B$, then $L(A) < L(B)$, but the converse is not necessarily true
Vector clock: an array of $n$ counters (one per node) that precisely captures the happens-before relationship; if $VC(A) < VC(B)$ component-wise, then $A$ causally precedes $B$
Happens-before ($\rightarrow$): a partial order on events; $A \rightarrow B$ if $A$ and $B$ are on the same node and $A$ occurred first, or if $A$ is a message send and $B$ is the corresponding receive
Concurrent events: two events $A$ and $B$ are concurrent ($A \| B$) if neither $A \rightarrow B$ nor $B \rightarrow A$; the system cannot determine which happened first
Version vector: similar to a vector clock but tracks versions of a data item rather than events; each entry records the latest version number from each node that has written to the item
Conflict: a situation where two or more concurrent writes modify the same data item, producing divergent versions that must be reconciled
Last-writer-wins (LWW): a conflict resolution strategy that keeps the write with the highest timestamp and discards others; simple but can lose data
CRDT (Conflict-free Replicated Data Type): a data structure designed so that concurrent updates can always be merged automatically without conflicts, guaranteeing eventual convergence
State-based CRDT (CvRDT): a CRDT where replicas periodically exchange their full state and merge using a join operation on a semilattice
Operation-based CRDT (CmRDT): a CRDT where replicas exchange operations (not state) and apply them; operations must be commutative
Semilattice: a mathematical structure with a join (least upper bound) operation that is commutative, associative, and idempotent; the foundation of state-based CRDTs
Sibling: in systems like Riak, when concurrent writes create conflicting versions, all versions are kept as siblings until the application or an automatic resolver picks one
Dotted version vector: an optimized version vector that uses "dots" (node, counter pairs) to precisely track individual write events, reducing false conflicts

What & Why

In a single-node system, ordering events is trivial: the CPU has one clock, and operations execute in sequence. In a distributed system, there is no global clock. Each node has its own clock, and those clocks drift. You cannot reliably determine which of two events on different nodes happened first by comparing wall-clock timestamps.

This matters because many operations depend on ordering. If two clients update the same record, you need to know which update came last (or whether they were concurrent and need merging). If a message is sent before another, the receiver should process them in the correct order.

Lamport timestamps provide a partial solution: a single counter that respects causality. But they cannot distinguish between causally related and concurrent events. If $L(A) < L(B)$, you know $A$ might have happened before $B$, but you cannot be sure.

Vector clocks solve this completely. By maintaining a counter per node, they capture the full causal history of each event. Two events can be compared precisely: one happened before the other, or they are concurrent. This enables accurate conflict detection in replicated systems.

Once conflicts are detected, they must be resolved. Last-writer-wins is simple but lossy. Application-level merge functions are flexible but complex. CRDTs are data structures that sidestep the problem entirely by ensuring that concurrent operations can always be merged automatically, with mathematical guarantees of convergence.

How It Works

Lamport Timestamps

Each node maintains a counter $C$. On every local event, increment $C$. When sending a message, attach $C$. When receiving a message with timestamp $T$, set $C = \max(C, T) + 1$.

This guarantees: if $A \rightarrow B$, then $L(A) < L(B)$. But the converse does not hold. Two concurrent events may have $L(A) < L(B)$ without $A$ happening before $B$.

Vector Clocks

Each node $i$ maintains a vector $VC[0..n-1]$ where $VC[j]$ represents the latest event count node $i$ knows about from node $j$.

Rules:

On a local event at node $i$: increment $VC[i]$
When sending a message from node $i$: increment $VC[i]$, attach $VC$ to the message
When receiving a message with vector $VC_{msg}$ at node $i$: for each $j$, set $VC[j] = \max(VC[j], VC_{msg}[j])$, then increment $VC[i]$

Comparison:

$VC(A) \leq VC(B)$ if every component of $A$ is $\leq$ the corresponding component of $B$
$A \rightarrow B$ if and only if $VC(A) < VC(B)$ (at least one component is strictly less)
$A | B$ (concurrent) if neither $VC(A) \leq VC(B)$ nor $VC(B) \leq VC(A)$

Conflict Detection with Version Vectors

When a client writes to a replica, the replica increments its entry in the version vector and stores the new value. When replicas sync, they compare version vectors:

If one vector dominates the other (every component $\geq$), the dominated version is older and can be discarded.
If neither dominates, the writes are concurrent and a conflict exists. The system must either keep both versions (siblings) or merge them.

Conflict Resolution Strategies

Last-Writer-Wins (LWW): Attach a wall-clock timestamp to each write. When conflicts arise, keep the write with the highest timestamp. This is simple and always converges, but it silently discards concurrent writes. Cassandra uses LWW by default.

Application-level merge: The system presents all conflicting versions to the application, which decides how to merge them. Amazon's shopping cart (Dynamo) used this approach: the application takes the union of items from all conflicting cart versions.

CRDTs: Data structures where the merge operation is built into the type itself. A G-Counter (grow-only counter) maintains a counter per node; the merged value is the component-wise maximum, and the total is the sum. A G-Set (grow-only set) merges via set union. More complex CRDTs like OR-Set (observed-remove set) handle both additions and removals.

CRDT Example: G-Counter

Each node $i$ maintains its own counter $c_i$. To increment, node $i$ increments $c_i$. The total count is $\sum_{i} c_i$. To merge two replicas, take the component-wise maximum: $\text{merged}[i] = \max(A[i], B[i])$.

This merge is commutative ($A \sqcup B = B \sqcup A$), associative ($(A \sqcup B) \sqcup C = A \sqcup (B \sqcup C)$), and idempotent ($A \sqcup A = A$). These three properties guarantee convergence regardless of message ordering or duplication.

Complexity Analysis

Mechanism	Space per Event	Comparison	Detects Concurrency?
Lamport timestamp	$O(1)$	$O(1)$	No
Vector clock	$O(n)$	$O(n)$	Yes
Dotted version vector	$O(n)$ worst case	$O(n)$	Yes (fewer false conflicts)
G-Counter CRDT	$O(n)$	$O(n)$ merge	N/A (conflict-free)

For a system with $n$ nodes, the vector clock size grows linearly with the number of nodes:

$\text{Vector clock size} = n \times \text{sizeof(counter)}$

With 64-bit counters and 100 nodes, each vector clock is 800 bytes. This is attached to every message and stored with every data version, so it becomes a significant overhead at scale.

Comparing two vector clocks requires checking all $n$ components:

$VC(A) < VC(B) \iff \forall i: A[i] \leq B[i] \land \exists j: A[j] < B[j]$

For CRDTs, the merge operation cost depends on the data structure. For a G-Counter with $n$ nodes:

$\text{Merge cost} = O(n)$

For an OR-Set with $m$ elements and $n$ nodes:

$\text{Merge cost} = O(m \times n)$

Implementation

ALGORITHM LamportTimestamp(node)
STATE: counter: integer, initialized to 0

OPERATION LocalEvent()
BEGIN
  counter <- counter + 1
  RETURN counter
END

OPERATION SendMessage(destination)
BEGIN
  counter <- counter + 1
  SEND(destination, message, counter)
END

OPERATION ReceiveMessage(senderTimestamp)
BEGIN
  counter <- MAX(counter, senderTimestamp) + 1
  RETURN counter
END

ALGORITHM VectorClockUpdate(nodeIndex, vectorClock, numNodes)
STATE: vectorClock: array of integers, size numNodes, initialized to all zeros

OPERATION LocalEvent()
BEGIN
  vectorClock[nodeIndex] <- vectorClock[nodeIndex] + 1
  RETURN COPY(vectorClock)
END

OPERATION SendMessage(destination)
BEGIN
  vectorClock[nodeIndex] <- vectorClock[nodeIndex] + 1
  SEND(destination, message, COPY(vectorClock))
END

OPERATION ReceiveMessage(senderVC)
BEGIN
  FOR i <- 0 TO numNodes - 1 DO
    vectorClock[i] <- MAX(vectorClock[i], senderVC[i])
  END FOR
  vectorClock[nodeIndex] <- vectorClock[nodeIndex] + 1
  RETURN COPY(vectorClock)
END

ALGORITHM CompareVectorClocks(vcA, vcB, n)
INPUT: vcA, vcB: vector clocks of size n
OUTPUT: BEFORE, AFTER, CONCURRENT, or EQUAL
BEGIN
  aLessOrEqual <- TRUE
  bLessOrEqual <- TRUE
  equal <- TRUE

  FOR i <- 0 TO n - 1 DO
    IF vcA[i] < vcB[i] THEN
      bLessOrEqual <- FALSE
      equal <- FALSE
    ELSE IF vcA[i] > vcB[i] THEN
      aLessOrEqual <- FALSE
      equal <- FALSE
    END IF
  END FOR

  IF equal THEN
    RETURN EQUAL
  ELSE IF aLessOrEqual THEN
    RETURN BEFORE    // A happened before B
  ELSE IF bLessOrEqual THEN
    RETURN AFTER     // A happened after B
  ELSE
    RETURN CONCURRENT
  END IF
END

ALGORITHM GCounterCRDT(nodeIndex, numNodes)
STATE: counts: array of integers, size numNodes, initialized to all zeros

OPERATION Increment()
BEGIN
  counts[nodeIndex] <- counts[nodeIndex] + 1
END

OPERATION Value()
BEGIN
  total <- 0
  FOR i <- 0 TO numNodes - 1 DO
    total <- total + counts[i]
  END FOR
  RETURN total
END

OPERATION Merge(otherCounts)
BEGIN
  FOR i <- 0 TO numNodes - 1 DO
    counts[i] <- MAX(counts[i], otherCounts[i])
  END FOR
END

ALGORITHM ConflictResolutionLWW(versions)
INPUT: versions: list of (value, timestamp, nodeID) tuples
OUTPUT: winning value
BEGIN
  best <- versions[0]
  FOR i <- 1 TO LENGTH(versions) - 1 DO
    IF versions[i].timestamp > best.timestamp THEN
      best <- versions[i]
    ELSE IF versions[i].timestamp = best.timestamp THEN
      // Break ties deterministically by nodeID
      IF versions[i].nodeID > best.nodeID THEN
        best <- versions[i]
      END IF
    END IF
  END FOR
  RETURN best.value
END

Real-World Applications

Amazon DynamoDB: uses vector clocks (and later, server-side timestamps) to detect conflicting writes across replicas; the original Dynamo paper described returning all conflicting versions to the application for resolution
Riak: uses dotted version vectors to track causality and detect concurrent writes; conflicting versions are stored as siblings and can be resolved by the application or by built-in resolvers
Redis CRDTs: Redis Enterprise supports CRDT-based data types (counters, sets, registers) for active-active geo-replication, allowing writes at any datacenter with automatic conflict-free merging
Figma: uses CRDTs for real-time collaborative design editing, allowing multiple users to edit the same document simultaneously with automatic conflict resolution
Apple Notes and iCloud: uses operational transformation and CRDT-like techniques to sync notes across devices, handling concurrent edits when devices are offline
Cassandra: uses last-writer-wins with wall-clock timestamps as the default conflict resolution strategy; this is simple and performant but can lose concurrent writes

Key Takeaways

Distributed systems cannot rely on physical clocks for ordering; logical clocks (Lamport timestamps, vector clocks) provide ordering guarantees based on causality
Lamport timestamps are compact ($O(1)$) but cannot detect concurrent events; vector clocks are larger ($O(n)$) but precisely capture the happens-before relationship
Two events are concurrent if neither vector clock dominates the other; concurrent writes to the same data create conflicts that must be resolved
Last-writer-wins is the simplest resolution strategy but silently discards data; application-level merge gives full control but adds complexity
CRDTs guarantee automatic convergence by designing data structures whose merge operations are commutative, associative, and idempotent
The choice between vector clocks, LWW, and CRDTs depends on the application: LWW for simplicity, vector clocks for accurate conflict detection, CRDTs for conflict-free collaboration

2021-02-26

Pessimistic vs Optimistic Locking: Controlling Concurrent Access

How pessimistic locking (SELECT FOR UPDATE) and optimistic locking (version columns, CAS-style updates) prevent race conditions in concurrent systems, when each strategy wins, and the trade-offs in throughput, latency, and deadlock risk.

2021-02-27

Distributed Locks and Why They Are Hard

How lease-based distributed locks work, why process pauses and clock skew make them dangerous, the Redis SETNX pattern, the Redlock algorithm and its controversy, and when to use distributed locks vs when to avoid them entirely.

2021-02-28

Fencing Tokens and Monotonic IDs: Making Distributed Locks Actually Safe

How fencing tokens prevent stale writes from expired leases, why the lock server alone cannot guarantee safety, how monotonically increasing tokens make the storage layer the final arbiter, and the connection to linearizability.