Notes/The Physical I/O Path: How Bytes Hit the Disk

The Physical I/O Path: How Bytes Hit the Disk

fsync semantics, O_DIRECT vs buffered I/O, the OS page cache, SSD vs HDD access patterns, torn write protection, and the fsync-gate incident that shook PostgreSQL.

2025-10-15AI-Synthesized from Personal Notes

Database InternalsDisk IoFsyncDirect IoStorage

Terminology

Term	Definition	Trade-off / Gotcha
fsync	A system call that forces the OS to flush dirty pages from the page cache to stable storage, returning only after the device confirms the write is durable.	Calling fsync after every write is safe but slow; batching writes before a single fsync improves throughput but increases the window of data loss on crash.
O_DIRECT	A file open flag that bypasses the OS page cache entirely, sending reads and writes directly between user-space buffers and the storage device.	Requires page-aligned buffers (typically 4 KB aligned) and page-sized I/O; misaligned writes silently fall back to buffered I/O on some kernels.
Page Cache	A region of kernel memory that caches recently read or written disk pages, allowing the OS to coalesce writes and serve repeated reads from RAM.	Databases that manage their own buffer pool (like PostgreSQL) end up caching data twice: once in the buffer pool and once in the page cache.
Torn Write	A partial page write where a power failure interrupts a multi-block write, leaving the on-disk page half-old and half-new.	A 16 KB database page written as four 4 KB filesystem blocks can be torn at any block boundary; checksums detect it but cannot fix it without a WAL replay.
Write Amplification	The ratio of actual bytes written to storage versus the logical bytes the application intended to write, caused by journaling, double-write buffers, and SSD internal operations.	A single 8 KB page update can produce 3-5x write amplification when you account for the WAL entry, filesystem journal, and SSD garbage collection.
NVMe	Non-Volatile Memory Express, a storage protocol designed for SSDs that communicates over PCIe with up to 65,535 I/O queues, each holding 65,536 commands.	NVMe latency advantages only materialize under concurrent workloads; a single-threaded sequential scan may see little difference from SATA SSD.
Queue Depth	The number of I/O requests the host submits to the storage device before waiting for completions, allowing the device to reorder and parallelize internally.	SATA supports a single queue of 32 commands; NVMe supports thousands of queues, which is why NVMe scales better under concurrent database workloads.
Double-Write Buffer	A technique (used by InnoDB) where pages are first written sequentially to a dedicated buffer area on disk, then written to their final location, protecting against torn writes.	Adds one extra write per page (2x write amplification for data pages) but eliminates the need for full-page WAL images that PostgreSQL uses instead.
fdatasync	A variant of fsync that flushes only the file data to disk, skipping metadata (file size, timestamps) unless the metadata is needed to locate the data.	Faster than fsync when the file size has not changed (common for pre-allocated database files), but some filesystems treat it identically to fsync.
Write Barrier	A storage command that ensures all writes issued before the barrier are persisted to stable storage before any writes issued after the barrier.	Disabling write barriers (common in older Linux configs) improves throughput but risks silent data loss on power failure, corrupting both filesystem and database.

What & Why

Every database makes the same fundamental promise: once a transaction commits, the data survives a crash. Delivering on that promise requires understanding exactly what happens between a write() system call in your database process and the moment electrons rearrange bits on a NAND chip or a magnetic head magnetizes a platter.

This path is surprisingly long and full of places where data can be lost. The OS page cache absorbs writes into volatile RAM. The filesystem journal adds its own writes. The block layer reorders I/O requests. The disk controller has a volatile write cache. The physical media has its own write granularity. At every layer, there is a trade-off between performance and durability.

Databases solve this with two key system calls: fsync (force data to stable storage) and O_DIRECT (bypass the OS page cache). But these tools are more subtle than they appear. PostgreSQL discovered in 2018 that fsync could silently fail on Linux, potentially corrupting data. MySQL and PostgreSQL take opposite approaches to the page cache, each with real engineering trade-offs. And the shift from spinning disks to NVMe SSDs has changed which I/O patterns matter.

Understanding the physical I/O path is not academic. It determines whether your database actually keeps its durability promise, how fast it can commit transactions, and why storage engine design choices differ so dramatically between systems.

How It Works

The Full Write Path

When a database calls write() followed by fsync(), the data passes through seven distinct layers before reaching stable storage. Each layer adds latency, and each is a potential point of failure.

The red "volatile!" label on the controller write cache is critical. Many storage devices report a write as complete when data reaches the volatile write cache, not when it hits stable media. The fsync call is supposed to force a cache flush (FUA or FLUSH CACHE command), but this depends on the device firmware honoring the command. Enterprise SSDs with power-loss capacitors can safely acknowledge writes from the volatile cache because the capacitor provides enough energy to flush to NAND on power loss.

O_DIRECT vs Buffered I/O

The OS page cache is a double-edged sword for databases. It speeds up reads by caching hot pages in kernel memory, but it also means the database's own buffer pool and the kernel are caching the same data twice. This "double buffering" wastes RAM and makes memory management unpredictable.

The choice between these approaches is a genuine engineering trade-off. PostgreSQL chose buffered I/O because it simplifies the codebase and lets the OS handle read-ahead for sequential scans. InnoDB chose O_DIRECT because it eliminates double caching and gives the storage engine full control over which pages stay in memory. RocksDB and ScyllaDB also use O_DIRECT for the same reason.

The double-caching problem is real. On a PostgreSQL server with 32 GB of RAM and shared_buffers set to 8 GB, the OS page cache might hold another 20 GB of the same database pages. That is 20 GB of RAM doing redundant work. InnoDB with O_DIRECT avoids this entirely: its buffer pool is the single source of truth for cached pages.

Torn Writes

A database page is typically 8 KB or 16 KB, but the filesystem writes in 4 KB blocks (matching the hardware sector size). If power fails in the middle of writing a 16 KB page, the on-disk result is a torn page: some blocks contain new data and some contain old data.

Databases protect against torn writes in different ways. InnoDB uses a double-write buffer: before writing a page to its final location, it writes the page to a sequential double-write area. If a torn write occurs at the final location, InnoDB recovers the intact copy from the double-write buffer. PostgreSQL takes a different approach: it writes a full-page image (FPI) to the WAL after each checkpoint, so a torn page can be reconstructed from the WAL. ZFS and btrfs provide copy-on-write semantics at the filesystem level, which eliminates torn writes entirely.

SSD vs HDD Access Patterns

The shift from spinning disks to solid-state drives fundamentally changed which I/O patterns matter for database performance. HDDs have a mechanical seek time (moving the read/write head) that makes random I/O extremely expensive. SSDs have no moving parts, so random reads are nearly as fast as sequential reads. But random writes on SSDs trigger garbage collection internally, which can cause latency spikes.

The numbers tell the story. HDD random read throughput is roughly 0.5 MB/s (limited by ~5 ms seek time), while NVMe random reads reach 500 MB/s, a 1000x improvement. This is why B-tree indexes (which do random page reads) perform dramatically better on SSDs. But notice that even on NVMe, sequential I/O is still 7x faster than random I/O. LSM-trees exploit this by converting random writes into sequential writes through compaction.

For HDDs, the I/O scheduler in the kernel is critical: it reorders requests to minimize head movement (elevator algorithm). For NVMe SSDs, the kernel scheduler is largely irrelevant because the device handles request scheduling internally with its multiple hardware queues.

The fsync-gate Incident

In 2018, the PostgreSQL community discovered a critical flaw in how Linux handles fsync failures. The assumption that had underpinned PostgreSQL's durability guarantees for decades turned out to be wrong.

The core issue is a mismatch between what databases expect and what the Linux kernel does. PostgreSQL expected that after a failed fsync, the dirty pages would remain in the page cache so the database could retry. Instead, the kernel drops the pages, and the next fsync succeeds vacuously because there are no dirty pages left. The data is silently lost.

This is one of the strongest arguments for O_DIRECT: if the database bypasses the page cache entirely, it controls the write path and knows exactly what has and has not been written to disk. With buffered I/O, the database is at the mercy of kernel page cache semantics that can change between kernel versions.

NVMe vs SATA Command Queuing

The performance gap between NVMe and SATA SSDs is not just about raw bandwidth. The protocol-level difference in command queuing explains why NVMe scales so much better under concurrent database workloads.

SATA (AHCI) supports a single command queue with a maximum depth of 32. Every I/O request from every thread goes into the same queue. NVMe supports up to 65,535 queues, each with up to 65,536 entries. In practice, databases configure one submission queue per CPU core, eliminating lock contention entirely.

SATA plateaus around queue depth 32 because the single hardware queue is saturated. NVMe continues scaling linearly because each additional queue depth maps to a separate hardware submission queue. For a database running 64 concurrent queries, each doing random 4 KB page reads, NVMe delivers roughly 7x the IOPS of SATA. This is why modern database deployments strongly prefer NVMe storage.

Filesystem Comparison for Database Workloads

The choice of filesystem affects database durability, performance, and which I/O strategies are available. Not all filesystems handle O_DIRECT, journaling, and torn write protection the same way.

ext4 and XFS are the most common choices for production databases. Both support O_DIRECT fully and have decades of battle-testing. ZFS provides copy-on-write semantics that eliminate torn writes at the filesystem level, but its interaction with O_DIRECT is complex (ZFS effectively ignores O_DIRECT and manages its own caching via the ARC). btrfs offers similar CoW benefits but has less production track record for database workloads.

Complexity Analysis

The cost of database I/O depends on the access pattern, the storage device, and whether the data is in the OS page cache or the database buffer pool.

I/O Cost Model

For a single random page read of size $P$ bytes from a storage device with seek time $T_s$, rotational latency $T_r$, and transfer rate $R$:

$T_{\text{random}} = T_s + T_r + \frac{P}{R}$

For an HDD with $T_s = 5\text{ms}$, $T_r = 4\text{ms}$ (half rotation at 7200 RPM), and $R = 150\text{MB/s}$:

$T_{\text{random,HDD}} = 5 + 4 + \frac{8\text{KB}}{150\text{MB/s}} \approx 9.05\text{ms}$

For an NVMe SSD, seek and rotational latency are zero, and the access time is dominated by the device's internal latency $T_d \approx 0.02\text{ms}$:

$T_{\text{random,NVMe}} \approx T_d + \frac{P}{R} = 0.02 + \frac{8\text{KB}}{3500\text{MB/s}} \approx 0.022\text{ms}$

This gives NVMe a roughly 400x advantage for random page reads.

Effective Bandwidth vs Queue Depth

The effective bandwidth $B_{\text{eff}}$ of a storage device scales with queue depth $Q$ until the device saturates:

$B_{\text{eff}}(Q) = \min\left(Q \times \frac{P}{T_{\text{random}}}, B_{\text{max}}\right)$

For SATA, $Q_{\text{max}} = 32$, so the device saturates early. For NVMe, $Q$ can reach thousands, allowing the device to approach its maximum bandwidth $B_{\text{max}}$.

Write Amplification from Journaling

A single logical page write of size $P$ produces multiple physical writes:

$W_{\text{total}} = W_{\text{WAL}} + W_{\text{FS journal}} + W_{\text{data}} + W_{\text{double-write}}$

For InnoDB with the double-write buffer enabled:

$W_{\text{InnoDB}} = P_{\text{redo log}} + P_{\text{doublewrite}} + P_{\text{final}} = P + P + P = 3P$

For PostgreSQL with full-page writes after checkpoint:

$W_{\text{PG}} = P_{\text{WAL(FPI)}} + P_{\text{data}} = P + P = 2P \text{ (first write after checkpoint)}$

Subsequent writes to the same page before the next checkpoint only write the WAL delta (much smaller than $P$), so PostgreSQL's amortized write amplification is lower than InnoDB's for hot pages.

Operation	HDD Latency	SATA SSD Latency	NVMe SSD Latency	Notes
Random 8 KB read	~9 ms	~0.1 ms	~0.02 ms	B-tree index traversal
Sequential 8 KB read	~0.05 ms	~0.015 ms	~0.002 ms	Range scan, WAL replay
Random 8 KB write	~9 ms	~0.05 ms	~0.02 ms	Dirty page flush
fsync (single file)	~9 ms	~0.5 ms	~0.02 ms	Commit latency floor
fsync (batch, 100 pages)	~50 ms	~2 ms	~0.1 ms	Group commit

Implementation

Pseudocode: Durable Write with fsync

ALGORITHM DurablePageWrite(page, file_descriptor)
INPUT: page: database page to write (8 KB aligned buffer)
       file_descriptor: open file handle
OUTPUT: SUCCESS or PANIC

BEGIN
  // Step 1: Write the page to the OS page cache (or directly if O_DIRECT)
  offset = page.page_id * PAGE_SIZE
  bytes_written = pwrite(file_descriptor, page.data, PAGE_SIZE, offset)

  IF bytes_written != PAGE_SIZE THEN
    LOG_ERROR("Short write: expected PAGE_SIZE, got", bytes_written)
    RETURN PANIC  // Cannot recover from partial write
  END IF

  // Step 2: Force the data to stable storage
  result = fsync(file_descriptor)

  IF result == ERROR THEN
    // fsync failed: the page cache state is unknown (fsync-gate)
    // The ONLY safe response is to crash and replay the WAL
    LOG_ERROR("fsync failed with EIO, initiating crash recovery")
    RETURN PANIC
  END IF

  RETURN SUCCESS
END

Pseudocode: Double-Write Buffer (InnoDB Style)

ALGORITHM DoubleWriteFlush(dirty_pages, doublewrite_fd, data_fd)
INPUT: dirty_pages: array of pages to flush
       doublewrite_fd: file descriptor for the double-write buffer
       data_fd: file descriptor for the data file
OUTPUT: SUCCESS or PANIC

BEGIN
  // Phase 1: Write all pages sequentially to the double-write buffer
  dw_offset = 0
  FOR EACH page IN dirty_pages DO
    pwrite(doublewrite_fd, page.data, PAGE_SIZE, dw_offset)
    dw_offset = dw_offset + PAGE_SIZE
  END FOR

  // Phase 2: fsync the double-write buffer (sequential write, fast)
  IF fsync(doublewrite_fd) == ERROR THEN
    RETURN PANIC
  END IF

  // Phase 3: Write pages to their final locations (random writes)
  FOR EACH page IN dirty_pages DO
    offset = page.page_id * PAGE_SIZE
    pwrite(data_fd, page.data, PAGE_SIZE, offset)
  END FOR

  // Phase 4: fsync the data file
  IF fsync(data_fd) == ERROR THEN
    RETURN PANIC
  END IF

  RETURN SUCCESS
END

// Recovery: if a page in the data file has a bad checksum (torn write),
// read the intact copy from the double-write buffer and overwrite it.

Pseudocode: O_DIRECT Aligned Write

ALGORITHM DirectIOWrite(page, file_path)
INPUT: page: database page with aligned buffer
       file_path: path to the data file
OUTPUT: SUCCESS or ERROR

BEGIN
  // Open with O_DIRECT to bypass the OS page cache
  fd = open(file_path, O_WRONLY | O_DIRECT | O_DSYNC)

  // O_DIRECT requires page-aligned buffers
  aligned_buffer = allocate_aligned(PAGE_SIZE, ALIGNMENT)  // 4 KB alignment
  memcpy(aligned_buffer, page.data, PAGE_SIZE)

  offset = page.page_id * PAGE_SIZE

  // Write size must be a multiple of the filesystem block size
  bytes_written = pwrite(fd, aligned_buffer, PAGE_SIZE, offset)

  IF bytes_written != PAGE_SIZE THEN
    RETURN ERROR
  END IF

  // O_DSYNC means the write is durable on return (no separate fsync needed)
  free(aligned_buffer)
  close(fd)
  RETURN SUCCESS
END

Real-World Applications

PostgreSQL uses buffered I/O by default and relies on the OS page cache for read-ahead during sequential scans. After the fsync-gate discovery, PostgreSQL 12 changed its error handling to panic on fsync failure and replay the WAL, rather than retrying the fsync. PostgreSQL protects against torn writes by writing full-page images to the WAL after each checkpoint.

MySQL/InnoDB uses O_DIRECT by default (innodb_flush_method=O_DIRECT) to avoid double caching. It protects against torn writes with the double-write buffer, writing each page twice: once to a sequential buffer area, then to the final location. InnoDB's redo log uses O_DIRECT with O_DSYNC for commit durability.

RocksDB (used by CockroachDB, TiKV, and many embedded systems) uses O_DIRECT for SST file writes and relies on the WAL for crash recovery. Its LSM-tree architecture naturally produces sequential writes, which aligns well with both SSD and HDD access patterns.

SQLite uses buffered I/O and fsync for durability in WAL mode. Its simplicity makes it the default embedded database, but it does not protect against torn writes at the page level. Instead, it relies on the WAL: if a page is torn, the database rolls back to the last consistent WAL checkpoint.

ZFS-based deployments (common with Oracle and some PostgreSQL setups) can disable the database's own torn-write protection because ZFS's copy-on-write semantics guarantee atomic page writes at the filesystem level. This eliminates the need for InnoDB's double-write buffer or PostgreSQL's full-page images, reducing write amplification.

Cloud storage (EBS, GCP Persistent Disk) adds another layer. These are network-attached block devices where fsync must traverse the network to the storage backend. Latency is higher (0.5-2 ms per fsync vs 0.02 ms for local NVMe), and the durability guarantee depends on the cloud provider's replication strategy, not the local disk firmware.

Key Takeaways

The path from write() to stable storage crosses seven layers: user space, page cache, filesystem journal, I/O scheduler, device driver, controller, and physical media. Data can be lost at any layer without proper fsync usage.
fsync forces data to stable storage, but the Linux kernel drops dirty pages from the page cache on fsync failure. The only safe response is to crash and replay the WAL (the lesson from fsync-gate).
O_DIRECT bypasses the OS page cache, eliminating double caching but requiring page-aligned buffers and database-managed read-ahead. PostgreSQL uses buffered I/O; InnoDB and RocksDB use O_DIRECT.
Torn writes occur when a multi-block page write is interrupted by power failure. InnoDB uses a double-write buffer (3x write amplification); PostgreSQL uses full-page WAL images (2x for the first write after checkpoint).
NVMe SSDs scale linearly with queue depth thanks to thousands of hardware queues, while SATA SSDs plateau at queue depth 32. This makes NVMe dramatically better for concurrent database workloads.
SSD random read latency (~0.02 ms on NVMe) is 400x faster than HDD (~9 ms), which is why B-tree indexes perform so much better on solid-state storage.
The filesystem choice matters: ext4 and XFS are battle-tested for databases with full O_DIRECT support; ZFS eliminates torn writes via copy-on-write but has complex O_DIRECT interactions.

2025-10-01

SQL vs NoSQL vs NewSQL: Choosing the Right Database Paradigm

Understanding the relational model, document/key-value/graph/column-family stores, ACID vs BASE guarantees, and how to pick the right database for your workload.

2025-10-02

TEST2:B-Tree Indexes Deep Dive: Why Your WHERE Clause Is Fast

How B-trees store sorted data on disk, how page splits work, what fill factor controls, and the difference between clustered and non-clustered indexes.