Transaction Isolation and MVCC: How Postgres Reads Without Locking
How transaction isolation levels prevent anomalies, how Postgres uses Multi-Version Concurrency Control to let readers and writers coexist, and why write skew still sneaks through snapshot isolation.
Terminology
| Term | Definition | Trade-off / Gotcha |
|---|---|---|
| Transaction | A sequence of database operations that execute as a single atomic unit: either all succeed (COMMIT) or all are rolled back (ABORT). | Long-running transactions hold resources and can block vacuum, causing table bloat. |
| Isolation Level | A setting that controls which anomalies a transaction can observe from concurrent transactions. SQL defines four: read uncommitted, read committed, repeatable read, serializable. | Higher isolation costs more in throughput; most applications default to read committed and accept the trade-offs. |
| MVCC | Multi-Version Concurrency Control. Instead of locking rows, the database keeps multiple physical versions of each row so readers see a consistent snapshot without blocking writers. | Old versions accumulate and must be cleaned up by vacuum, or storage bloats indefinitely. |
| Snapshot | A frozen view of which transactions were committed at the moment a transaction (or statement) began. Determines which row versions are visible. | Snapshots are cheap to create ($O(1)$ in Postgres) but long-lived snapshots prevent vacuum from reclaiming old versions. |
| Write Skew | An anomaly where two concurrent transactions each read overlapping data, make decisions based on it, then write to non-overlapping rows, producing a state that neither transaction would allow alone. | Snapshot isolation does not prevent write skew; you need serializable isolation or explicit application-level locks. |
| xmin / xmax | Postgres system columns on every heap tuple. xmin is the transaction ID that created the version; xmax is the transaction ID that deleted or updated it (0 if still live). |
These hidden columns consume 8 bytes per row and are the core mechanism behind MVCC visibility checks. |
| Vacuum | A background process that reclaims storage occupied by dead tuple versions that are no longer visible to any active transaction. | If vacuum falls behind (due to long transactions or misconfiguration), tables grow without bound, a condition called table bloat. |
| Dirty Read | Reading uncommitted data from another transaction. Allowed under read uncommitted but prevented by all other isolation levels. | Postgres never allows dirty reads, even at read uncommitted (it silently upgrades to read committed). |
| Phantom Read | A query returns different sets of rows when executed twice in the same transaction because another transaction inserted or deleted matching rows between executions. | Repeatable read in Postgres prevents phantoms (unlike the SQL standard minimum), because it uses snapshot isolation. |
What & Why
When two users update the same bank account at the same time, something has to prevent them from both reading the old balance and both subtracting from it. That "something" is transaction isolation, and the mechanism most modern databases use to enforce it without grinding to a halt is MVCC.
The naive approach is locking: before reading a row, acquire a shared lock; before writing, acquire an exclusive lock. This is correct but slow. Readers block writers, writers block readers, and under high concurrency the database spends more time managing locks than executing queries.
MVCC flips this model. Instead of locking a row to read it, the database keeps old versions around so each transaction can read from a consistent snapshot of the past. Writers create new versions rather than modifying rows in place. The result: readers never block writers, and writers never block readers. The cost is storage overhead from multiple versions and the need for a garbage collector (vacuum) to clean up versions that no active transaction can see.
This post covers the four SQL isolation levels, how Postgres implements them using snapshots and MVCC, the physical layout of versioned tuples on disk, and the write skew anomaly that catches even experienced engineers off guard.
How It Works
Transaction Lifecycle
Every transaction in an MVCC database follows a predictable lifecycle. Understanding these phases clarifies when snapshots are taken and when writes become visible to other transactions.
MVCC Tuple Versioning
In Postgres, every row on disk is a heap tuple that carries version metadata. When a row is updated, Postgres does not overwrite the old data. Instead, it inserts a new tuple version and marks the old one as expired by setting its xmax field.
The diagram below shows a single heap page containing three versions of the same logical row (account_id = 42). The oldest version was created by transaction 100 and expired by transaction 105. Transaction 105 created a new version, which was later expired by transaction 110. The current live version was created by transaction 110 and has xmax = 0 (still active).
Isolation Level Comparison
The SQL standard defines four isolation levels that progressively prevent more anomalies. Postgres implements them using snapshots rather than traditional lock-based approaches, which means some levels are stronger than the standard requires.
* Postgres silently upgrades read uncommitted to read committed, so dirty reads never actually occur.
** The SQL standard allows phantoms at repeatable read, but Postgres prevents them because it uses full transaction-level snapshots.
Postgres MVCC Read Path
When a transaction reads a heap tuple, it must determine whether that specific version is visible. The decision depends on the tuple's xmin and xmax values compared to the transaction's snapshot.
The visibility check is the heart of MVCC. For each tuple, the algorithm asks: was the creating transaction committed before my snapshot? If yes, is the row still alive (xmax not set or not committed in my snapshot)? Only tuples that pass both checks are returned to the query.
Pessimistic vs Optimistic Concurrency
Databases broadly choose between two concurrency strategies. Traditional lock-based systems use pessimistic concurrency: assume conflicts will happen and prevent them upfront. MVCC-based systems use optimistic concurrency: assume conflicts are rare and detect them after the fact.
Write Skew Anomaly
Write skew is the most subtle concurrency anomaly. It occurs under snapshot isolation when two transactions read overlapping data, make independent decisions, and write to different rows, producing a combined state that violates an application invariant.
Consider a hospital on-call system: at least one doctor must always be on call. Doctors Alice and Bob are both on call. Each checks the constraint ("are there at least 2 doctors on call?"), sees yes, and removes themselves. Both transactions commit, and now zero doctors are on call.
The fix is to use serializable isolation (which detects the rw-dependency cycle) or to add an explicit SELECT ... FOR UPDATE on the rows being checked, forcing a write-write conflict that one transaction will lose.
Throughput Under Contention
Different isolation levels impose different overhead. Under high contention (many transactions competing for the same rows), read committed performs best because each statement gets a fresh snapshot and never retries. Repeatable read may need to abort and retry when it detects a conflict. Serializable adds predicate lock tracking and has the highest abort rate.
These numbers are illustrative for a high-contention OLTP workload (many short transactions updating overlapping rows). Under low contention or read-heavy workloads, the gap narrows significantly because snapshot overhead is minimal and aborts are rare.
Complexity Analysis
MVCC shifts costs from lock acquisition to version management. The key operations and their costs:
Snapshot Creation
$T_{\text{snapshot}} = O(1)$
In Postgres, creating a snapshot copies the current transaction state (the xmin, xmax, and xip_list of active transactions) from shared memory. This is a constant-time operation proportional to the number of concurrent transactions, not the size of the database.
Version Chain Traversal
$T_{\text{read}} = O(v)$
where $v$ is the number of versions in the chain for a given row. In the worst case, a frequently updated row with long-running transactions can accumulate many versions. Each version requires a visibility check. HOT chains keep versions on the same page, avoiding extra I/O.
Vacuum Overhead
$T_{\text{vacuum}} = O(d \cdot p)$
where $d$ is the number of dead tuples and $p$ is the number of pages containing dead tuples. Vacuum must scan pages, remove dead tuples, update the free space map, and update the visibility map. Autovacuum runs continuously in the background, throttled to avoid overwhelming I/O.
Storage Amplification
$S_{\text{total}} = S_{\text{live}} \cdot (1 + \alpha)$
where $\alpha$ is the bloat factor: the ratio of dead tuple storage to live tuple storage. A well-tuned system keeps $\alpha < 0.2$ (less than 20% bloat). Without vacuum, $\alpha$ grows without bound.
| Operation | Cost | Depends On |
|---|---|---|
| Snapshot creation | $O(1)$ | Number of active transactions |
| Tuple visibility check | $O(1)$ per version | Snapshot comparison |
| Version chain traversal | $O(v)$ | Update frequency, vacuum lag |
| Vacuum scan | $O(d \cdot p)$ | Dead tuple count, page spread |
| Storage overhead | $O(n \cdot \alpha)$ | Update rate, vacuum frequency |
Implementation
MVCC Visibility Check
The core algorithm that every SELECT in Postgres executes for each candidate tuple:
FUNCTION is_visible(tuple, snapshot):
// Step 1: Check if the creating transaction is visible
IF tuple.xmin = snapshot.current_xid THEN
// We created this tuple in our own transaction
IF tuple.xmax = 0 THEN
RETURN true // Not deleted by us
ELSE
RETURN false // We deleted it ourselves
END IF
END IF
IF tuple.xmin NOT IN snapshot.committed_xids THEN
RETURN false // Creator not yet committed in our snapshot
END IF
IF tuple.xmin IN snapshot.active_xids THEN
RETURN false // Creator was still running when we took snapshot
END IF
// Step 2: Check if the row has been deleted
IF tuple.xmax = 0 THEN
RETURN true // Row is live, no one has deleted it
END IF
IF tuple.xmax = snapshot.current_xid THEN
RETURN false // We deleted this row
END IF
IF tuple.xmax NOT IN snapshot.committed_xids THEN
RETURN true // Deleter has not committed, row is still visible
END IF
IF tuple.xmax IN snapshot.active_xids THEN
RETURN true // Deleter was still running, row is still visible
END IF
// Deleter committed before our snapshot: row is gone
RETURN false
END FUNCTION
Snapshot Isolation Transaction Flow
The complete flow for a transaction running under repeatable read (snapshot isolation):
FUNCTION execute_transaction_snapshot_isolation(operations):
xid = allocate_new_transaction_id()
snapshot = capture_snapshot() // Freeze committed state
FOR EACH op IN operations DO
IF op.type = READ THEN
tuples = scan_heap(op.table, op.predicate)
visible = []
FOR EACH t IN tuples DO
IF is_visible(t, snapshot) THEN
visible.append(t)
END IF
END FOR
op.result = visible
ELSE IF op.type = WRITE THEN
target = find_current_version(op.row_id, snapshot)
IF target.xmax != 0 AND target.xmax != xid THEN
// Another transaction modified this row
IF is_committed(target.xmax) THEN
ABORT(xid) // Serialization failure, retry
RETURN error("concurrent update detected")
ELSE
// Other transaction still running, wait
wait_for(target.xmax)
// After wait, re-check
IF is_committed(target.xmax) THEN
ABORT(xid)
RETURN error("concurrent update detected")
END IF
END IF
END IF
new_tuple = create_version(op.new_data, xid)
target.xmax = xid // Mark old version as expired
insert_into_heap(new_tuple)
END IF
END FOR
mark_committed(xid)
RETURN success
END FUNCTION
Real-World Applications
Postgres MVCC and transaction isolation are not academic concepts. They directly affect how production systems behave:
-
Banking and financial systems rely on serializable isolation for balance transfers. A transfer from account A to account B must see consistent balances for both accounts, and write skew on overlapping constraints (like overdraft limits) must be prevented.
-
E-commerce inventory systems typically use read committed for catalog browsing (high throughput, stale reads acceptable) but repeatable read or serializable for checkout flows where the "last item in stock" race condition matters.
-
Analytics and reporting queries benefit enormously from MVCC. A long-running report can read a consistent snapshot of the entire database while OLTP transactions continue writing, with zero lock contention between the two workloads.
-
Multi-tenant SaaS platforms use MVCC to isolate tenant queries. Each tenant's dashboard query sees a consistent snapshot, even while other tenants are actively writing data to shared tables.
-
Queue-based systems (using Postgres as a job queue with
SELECT ... FOR UPDATE SKIP LOCKED) rely on the interaction between row-level locks and MVCC visibility to distribute work items across consumers without double-processing.
Key Takeaways
-
MVCC lets readers and writers coexist without blocking each other. Readers see a consistent snapshot; writers create new tuple versions instead of modifying in place.
-
Postgres implements four isolation levels using snapshots, not traditional locks. Read committed takes a new snapshot per statement. Repeatable read and serializable take one snapshot for the entire transaction.
-
Write skew is the anomaly that snapshot isolation misses. Two transactions can read overlapping data and write non-overlapping rows, violating application invariants. Only serializable isolation or explicit locking prevents it.
-
Vacuum is the price of MVCC. Dead tuple versions accumulate until vacuum reclaims them. Long-running transactions prevent vacuum from cleaning up, causing table bloat.
-
The visibility check (xmin/xmax against the snapshot) runs for every tuple on every read. It is fast ($O(1)$ per tuple) but the version chain length matters: frequently updated rows with vacuum lag create longer chains.
Read More
2025-10-15
The Physical I/O Path: How Bytes Hit the Disk
fsync semantics, O_DIRECT vs buffered I/O, the OS page cache, SSD vs HDD access patterns, torn write protection, and the fsync-gate incident that shook PostgreSQL.
2025-10-01
SQL vs NoSQL vs NewSQL: Choosing the Right Database Paradigm
Understanding the relational model, document/key-value/graph/column-family stores, ACID vs BASE guarantees, and how to pick the right database for your workload.
2025-10-02
TEST2:B-Tree Indexes Deep Dive: Why Your WHERE Clause Is Fast
How B-trees store sorted data on disk, how page splits work, what fill factor controls, and the difference between clustered and non-clustered indexes.