Notes/Transaction Isolation and MVCC: How Postgres Reads Without Locking

Transaction Isolation and MVCC: How Postgres Reads Without Locking

How transaction isolation levels prevent anomalies, how Postgres uses Multi-Version Concurrency Control to let readers and writers coexist, and why write skew still sneaks through snapshot isolation.

2025-10-05AI-Synthesized from Personal Notes

Database InternalsTransactionsMvccIsolation Levels

Terminology

Term	Definition	Trade-off / Gotcha
Transaction	A sequence of database operations that execute as a single atomic unit: either all succeed (COMMIT) or all are rolled back (ABORT).	Long-running transactions hold resources and can block vacuum, causing table bloat.
Isolation Level	A setting that controls which anomalies a transaction can observe from concurrent transactions. SQL defines four: read uncommitted, read committed, repeatable read, serializable.	Higher isolation costs more in throughput; most applications default to read committed and accept the trade-offs.
MVCC	Multi-Version Concurrency Control. Instead of locking rows, the database keeps multiple physical versions of each row so readers see a consistent snapshot without blocking writers.	Old versions accumulate and must be cleaned up by vacuum, or storage bloats indefinitely.
Snapshot	A frozen view of which transactions were committed at the moment a transaction (or statement) began. Determines which row versions are visible.	Snapshots are cheap to create ($O(1)$ in Postgres) but long-lived snapshots prevent vacuum from reclaiming old versions.
Write Skew	An anomaly where two concurrent transactions each read overlapping data, make decisions based on it, then write to non-overlapping rows, producing a state that neither transaction would allow alone.	Snapshot isolation does not prevent write skew; you need serializable isolation or explicit application-level locks.
xmin / xmax	Postgres system columns on every heap tuple. `xmin` is the transaction ID that created the version; `xmax` is the transaction ID that deleted or updated it (0 if still live).	These hidden columns consume 8 bytes per row and are the core mechanism behind MVCC visibility checks.
Vacuum	A background process that reclaims storage occupied by dead tuple versions that are no longer visible to any active transaction.	If vacuum falls behind (due to long transactions or misconfiguration), tables grow without bound, a condition called table bloat.
Dirty Read	Reading uncommitted data from another transaction. Allowed under read uncommitted but prevented by all other isolation levels.	Postgres never allows dirty reads, even at read uncommitted (it silently upgrades to read committed).
Phantom Read	A query returns different sets of rows when executed twice in the same transaction because another transaction inserted or deleted matching rows between executions.	Repeatable read in Postgres prevents phantoms (unlike the SQL standard minimum), because it uses snapshot isolation.

What & Why

When two users update the same bank account at the same time, something has to prevent them from both reading the old balance and both subtracting from it. That "something" is transaction isolation, and the mechanism most modern databases use to enforce it without grinding to a halt is MVCC.

The naive approach is locking: before reading a row, acquire a shared lock; before writing, acquire an exclusive lock. This is correct but slow. Readers block writers, writers block readers, and under high concurrency the database spends more time managing locks than executing queries.

MVCC flips this model. Instead of locking a row to read it, the database keeps old versions around so each transaction can read from a consistent snapshot of the past. Writers create new versions rather than modifying rows in place. The result: readers never block writers, and writers never block readers. The cost is storage overhead from multiple versions and the need for a garbage collector (vacuum) to clean up versions that no active transaction can see.

This post covers the four SQL isolation levels, how Postgres implements them using snapshots and MVCC, the physical layout of versioned tuples on disk, and the write skew anomaly that catches even experienced engineers off guard.

How It Works

Transaction Lifecycle

Every transaction in an MVCC database follows a predictable lifecycle. Understanding these phases clarifies when snapshots are taken and when writes become visible to other transactions.

MVCC Tuple Versioning

In Postgres, every row on disk is a heap tuple that carries version metadata. When a row is updated, Postgres does not overwrite the old data. Instead, it inserts a new tuple version and marks the old one as expired by setting its xmax field.

The diagram below shows a single heap page containing three versions of the same logical row (account_id = 42). The oldest version was created by transaction 100 and expired by transaction 105. Transaction 105 created a new version, which was later expired by transaction 110. The current live version was created by transaction 110 and has xmax = 0 (still active).

Isolation Level Comparison

The SQL standard defines four isolation levels that progressively prevent more anomalies. Postgres implements them using snapshots rather than traditional lock-based approaches, which means some levels are stronger than the standard requires.

* Postgres silently upgrades read uncommitted to read committed, so dirty reads never actually occur.

** The SQL standard allows phantoms at repeatable read, but Postgres prevents them because it uses full transaction-level snapshots.

Postgres MVCC Read Path

When a transaction reads a heap tuple, it must determine whether that specific version is visible. The decision depends on the tuple's xmin and xmax values compared to the transaction's snapshot.

The visibility check is the heart of MVCC. For each tuple, the algorithm asks: was the creating transaction committed before my snapshot? If yes, is the row still alive (xmax not set or not committed in my snapshot)? Only tuples that pass both checks are returned to the query.

Pessimistic vs Optimistic Concurrency

Databases broadly choose between two concurrency strategies. Traditional lock-based systems use pessimistic concurrency: assume conflicts will happen and prevent them upfront. MVCC-based systems use optimistic concurrency: assume conflicts are rare and detect them after the fact.

Write Skew Anomaly

Write skew is the most subtle concurrency anomaly. It occurs under snapshot isolation when two transactions read overlapping data, make independent decisions, and write to different rows, producing a combined state that violates an application invariant.

Consider a hospital on-call system: at least one doctor must always be on call. Doctors Alice and Bob are both on call. Each checks the constraint ("are there at least 2 doctors on call?"), sees yes, and removes themselves. Both transactions commit, and now zero doctors are on call.

The fix is to use serializable isolation (which detects the rw-dependency cycle) or to add an explicit SELECT ... FOR UPDATE on the rows being checked, forcing a write-write conflict that one transaction will lose.

Throughput Under Contention

Different isolation levels impose different overhead. Under high contention (many transactions competing for the same rows), read committed performs best because each statement gets a fresh snapshot and never retries. Repeatable read may need to abort and retry when it detects a conflict. Serializable adds predicate lock tracking and has the highest abort rate.

These numbers are illustrative for a high-contention OLTP workload (many short transactions updating overlapping rows). Under low contention or read-heavy workloads, the gap narrows significantly because snapshot overhead is minimal and aborts are rare.

Complexity Analysis

MVCC shifts costs from lock acquisition to version management. The key operations and their costs:

Snapshot Creation

$T_{\text{snapshot}} = O(1)$

In Postgres, creating a snapshot copies the current transaction state (the xmin, xmax, and xip_list of active transactions) from shared memory. This is a constant-time operation proportional to the number of concurrent transactions, not the size of the database.

Version Chain Traversal

$T_{\text{read}} = O(v)$

where $v$ is the number of versions in the chain for a given row. In the worst case, a frequently updated row with long-running transactions can accumulate many versions. Each version requires a visibility check. HOT chains keep versions on the same page, avoiding extra I/O.

Vacuum Overhead

$T_{\text{vacuum}} = O(d \cdot p)$

where $d$ is the number of dead tuples and $p$ is the number of pages containing dead tuples. Vacuum must scan pages, remove dead tuples, update the free space map, and update the visibility map. Autovacuum runs continuously in the background, throttled to avoid overwhelming I/O.

Storage Amplification

$S_{\text{total}} = S_{\text{live}} \cdot (1 + \alpha)$

where $\alpha$ is the bloat factor: the ratio of dead tuple storage to live tuple storage. A well-tuned system keeps $\alpha < 0.2$ (less than 20% bloat). Without vacuum, $\alpha$ grows without bound.

Operation	Cost	Depends On
Snapshot creation	$O(1)$	Number of active transactions
Tuple visibility check	$O(1)$ per version	Snapshot comparison
Version chain traversal	$O(v)$	Update frequency, vacuum lag
Vacuum scan	$O(d \cdot p)$	Dead tuple count, page spread
Storage overhead	$O(n \cdot \alpha)$	Update rate, vacuum frequency

Implementation

MVCC Visibility Check

The core algorithm that every SELECT in Postgres executes for each candidate tuple:

FUNCTION is_visible(tuple, snapshot):
    // Step 1: Check if the creating transaction is visible
    IF tuple.xmin = snapshot.current_xid THEN
        // We created this tuple in our own transaction
        IF tuple.xmax = 0 THEN
            RETURN true   // Not deleted by us
        ELSE
            RETURN false  // We deleted it ourselves
        END IF
    END IF

    IF tuple.xmin NOT IN snapshot.committed_xids THEN
        RETURN false  // Creator not yet committed in our snapshot
    END IF

    IF tuple.xmin IN snapshot.active_xids THEN
        RETURN false  // Creator was still running when we took snapshot
    END IF

    // Step 2: Check if the row has been deleted
    IF tuple.xmax = 0 THEN
        RETURN true  // Row is live, no one has deleted it
    END IF

    IF tuple.xmax = snapshot.current_xid THEN
        RETURN false  // We deleted this row
    END IF

    IF tuple.xmax NOT IN snapshot.committed_xids THEN
        RETURN true  // Deleter has not committed, row is still visible
    END IF

    IF tuple.xmax IN snapshot.active_xids THEN
        RETURN true  // Deleter was still running, row is still visible
    END IF

    // Deleter committed before our snapshot: row is gone
    RETURN false
END FUNCTION

Snapshot Isolation Transaction Flow

The complete flow for a transaction running under repeatable read (snapshot isolation):

FUNCTION execute_transaction_snapshot_isolation(operations):
    xid = allocate_new_transaction_id()
    snapshot = capture_snapshot()  // Freeze committed state

    FOR EACH op IN operations DO
        IF op.type = READ THEN
            tuples = scan_heap(op.table, op.predicate)
            visible = []
            FOR EACH t IN tuples DO
                IF is_visible(t, snapshot) THEN
                    visible.append(t)
                END IF
            END FOR
            op.result = visible

        ELSE IF op.type = WRITE THEN
            target = find_current_version(op.row_id, snapshot)
            IF target.xmax != 0 AND target.xmax != xid THEN
                // Another transaction modified this row
                IF is_committed(target.xmax) THEN
                    ABORT(xid)  // Serialization failure, retry
                    RETURN error("concurrent update detected")
                ELSE
                    // Other transaction still running, wait
                    wait_for(target.xmax)
                    // After wait, re-check
                    IF is_committed(target.xmax) THEN
                        ABORT(xid)
                        RETURN error("concurrent update detected")
                    END IF
                END IF
            END IF

            new_tuple = create_version(op.new_data, xid)
            target.xmax = xid  // Mark old version as expired
            insert_into_heap(new_tuple)
        END IF
    END FOR

    mark_committed(xid)
    RETURN success
END FUNCTION

Real-World Applications

Postgres MVCC and transaction isolation are not academic concepts. They directly affect how production systems behave:

Banking and financial systems rely on serializable isolation for balance transfers. A transfer from account A to account B must see consistent balances for both accounts, and write skew on overlapping constraints (like overdraft limits) must be prevented.
E-commerce inventory systems typically use read committed for catalog browsing (high throughput, stale reads acceptable) but repeatable read or serializable for checkout flows where the "last item in stock" race condition matters.
Analytics and reporting queries benefit enormously from MVCC. A long-running report can read a consistent snapshot of the entire database while OLTP transactions continue writing, with zero lock contention between the two workloads.
Multi-tenant SaaS platforms use MVCC to isolate tenant queries. Each tenant's dashboard query sees a consistent snapshot, even while other tenants are actively writing data to shared tables.
Queue-based systems (using Postgres as a job queue with SELECT ... FOR UPDATE SKIP LOCKED) rely on the interaction between row-level locks and MVCC visibility to distribute work items across consumers without double-processing.

Key Takeaways

MVCC lets readers and writers coexist without blocking each other. Readers see a consistent snapshot; writers create new tuple versions instead of modifying in place.
Postgres implements four isolation levels using snapshots, not traditional locks. Read committed takes a new snapshot per statement. Repeatable read and serializable take one snapshot for the entire transaction.
Write skew is the anomaly that snapshot isolation misses. Two transactions can read overlapping data and write non-overlapping rows, violating application invariants. Only serializable isolation or explicit locking prevents it.
Vacuum is the price of MVCC. Dead tuple versions accumulate until vacuum reclaims them. Long-running transactions prevent vacuum from cleaning up, causing table bloat.
The visibility check (xmin/xmax against the snapshot) runs for every tuple on every read. It is fast ($O(1)$ per tuple) but the version chain length matters: frequently updated rows with vacuum lag create longer chains.

2025-10-15

The Physical I/O Path: How Bytes Hit the Disk

fsync semantics, O_DIRECT vs buffered I/O, the OS page cache, SSD vs HDD access patterns, torn write protection, and the fsync-gate incident that shook PostgreSQL.

2025-10-01

SQL vs NoSQL vs NewSQL: Choosing the Right Database Paradigm

Understanding the relational model, document/key-value/graph/column-family stores, ACID vs BASE guarantees, and how to pick the right database for your workload.

2025-10-02

TEST2:B-Tree Indexes Deep Dive: Why Your WHERE Clause Is Fast

How B-trees store sorted data on disk, how page splits work, what fill factor controls, and the difference between clustered and non-clustered indexes.