Notes/Memory Hierarchy: From Registers to Disk

Memory Hierarchy: From Registers to Disk

How computers organize memory into layers of increasing size and decreasing speed, and why locality of reference makes this hierarchy effective.

2021-03-06AI-Synthesized from Personal Notes

Computer SystemsMemoryCache

Terminology

Memory hierarchy: the organization of storage in a computer system into multiple levels, each trading off speed, size, and cost; faster levels are smaller and more expensive per byte
Register: the fastest and smallest storage, located inside the CPU core; holds data currently being operated on by instructions
SRAM (Static Random-Access Memory): a type of memory that uses flip-flop circuits to store each bit; fast and does not need refreshing, used for CPU caches
DRAM (Dynamic Random-Access Memory): a type of memory that stores each bit as a charge in a capacitor; slower than SRAM and requires periodic refreshing, used for main memory (RAM)
Cache line (cache block): the smallest unit of data that can be transferred between cache levels, typically 64 bytes on modern processors
Temporal locality: the tendency of a program to access the same memory locations repeatedly within a short time period
Spatial locality: the tendency of a program to access memory locations near recently accessed locations
Hit rate: the fraction of memory accesses that are satisfied by a given cache level, expressed as a value between 0 and 1
Miss rate: the fraction of memory accesses not found in a given cache level; equal to $1 - \text{hit rate}$
Miss penalty: the additional time required to fetch data from a lower (slower) level of the hierarchy when a cache miss occurs
Write-back: a cache write policy where modified data is written to the next level only when the cache line is evicted
Write-through: a cache write policy where every write to the cache is simultaneously written to the next level
LRU (Least Recently Used): a cache replacement policy that evicts the cache line that has not been accessed for the longest time
Associativity: the number of cache lines in a set where a given memory address can be placed; higher associativity reduces conflict misses but increases lookup time
Direct-mapped cache: a cache where each memory address maps to exactly one cache line (1-way associative)
Set-associative cache: a cache where each memory address maps to a set of cache lines; $n$-way means each set has $n$ lines
Fully associative cache: a cache where a memory address can be placed in any cache line
TLB (Translation Lookaside Buffer): a specialized cache that stores recent virtual-to-physical address translations to speed up memory access
Prefetching: a technique where the hardware or software predicts future memory accesses and loads data into the cache before it is requested
Bandwidth: the rate at which data can be transferred between memory levels, measured in bytes per second
Latency: the time delay from requesting data to receiving it

What & Why

The fundamental problem in computer architecture is that processors are fast and memory is slow. A modern CPU can execute an instruction every fraction of a nanosecond, but fetching a single byte from main memory takes roughly 100 nanoseconds. That is a 100x to 300x speed gap. If the CPU had to wait for main memory on every access, it would spend most of its time idle.

The memory hierarchy solves this by placing small, fast memories close to the CPU and large, slow memories farther away. Registers sit inside the CPU core and respond in a single cycle. L1 cache is a few kilobytes and responds in 1-2 cycles. L2 cache is hundreds of kilobytes at 4-10 cycles. L3 cache is megabytes at 20-40 cycles. Main memory (DRAM) is gigabytes at 100-300 cycles. Disk storage is terabytes at millions of cycles.

This works because of locality. Programs do not access memory randomly. They tend to reuse the same data (temporal locality) and access nearby addresses (spatial locality). A loop that iterates over an array accesses the same instructions repeatedly and reads array elements sequentially. The cache captures this working set, and the CPU rarely needs to wait for slow memory.

Understanding the memory hierarchy is essential for writing performant software. The difference between cache-friendly and cache-hostile code can be 10x or more in execution time, even for the same algorithmic complexity.

How It Works

The Hierarchy Pyramid

How Caches Work

When the CPU needs data at a memory address, it first checks the L1 cache. The address is split into three parts:

Tag: identifies which block of memory is stored in this cache line
Set index: determines which set in the cache to look in
Block offset: identifies the specific byte within the cache line

For a cache with $S$ sets, $E$ lines per set (associativity), and $B$ bytes per block:

$\text{Cache size} = S \times E \times B \text{ bytes}$

The set index uses $\log_2(S)$ bits, the block offset uses $\log_2(B)$ bits, and the remaining address bits form the tag.

Direct-mapped ($E = 1$): Each address maps to exactly one cache line. Fast lookup but vulnerable to conflict misses when two frequently accessed addresses map to the same line.

Set-associative ($E = 2, 4, 8$): Each address maps to a set of $E$ lines. The cache checks all lines in the set in parallel. Reduces conflict misses at the cost of slightly more complex hardware.

Fully associative ($S = 1$): Any address can go in any line. Eliminates conflict misses but requires comparing the tag against every line, which is expensive for large caches.

Cache Replacement Policies

When a cache set is full and a new block must be loaded, the cache must evict an existing line. Common policies:

LRU (Least Recently Used): Evict the line that has not been accessed for the longest time. Good approximation of optimal replacement but expensive to implement exactly for high associativity.

Pseudo-LRU: Approximates LRU using a binary tree of bits. Each access updates the tree to point away from the accessed line. Much cheaper to implement than true LRU.

Random: Evict a random line. Surprisingly competitive with LRU in practice and trivial to implement.

Write Policies

Write-back: Modified data stays in the cache and is only written to the next level when evicted. Reduces memory traffic but requires a "dirty" bit per line to track modifications.

Write-through: Every write goes to both the cache and the next level immediately. Simpler but generates more memory traffic. Often combined with a write buffer to avoid stalling the CPU.

Locality in Practice

Consider iterating over a 2D array stored in row-major order (C/C++ default). Accessing elements row by row exploits spatial locality because consecutive elements are adjacent in memory. Accessing elements column by column causes a cache miss on nearly every access because consecutive column elements are separated by an entire row's worth of bytes.

For an $N \times N$ array with elements of size $s$ bytes and cache lines of $B$ bytes:

$\text{Row-major miss rate} \approx \frac{s}{B}$

$\text{Column-major miss rate} \approx 1 \text{ (if } N \times s > \text{cache size)}$

With 4-byte integers and 64-byte cache lines, row-major access misses once every 16 accesses ($\frac{4}{64}$), while column-major access can miss on every single access.

Complexity Analysis

Memory Level	Typical Latency	Typical Size	Bandwidth
Registers	$\sim 0.3$ ns	$\sim 1$ KB	TB/s
L1 Cache	$\sim 1$ ns	$32$-$64$ KB	$\sim 1$ TB/s
L2 Cache	$\sim 3$-$10$ ns	$256$ KB-$1$ MB	$\sim 500$ GB/s
L3 Cache	$\sim 10$-$40$ ns	$4$-$64$ MB	$\sim 200$ GB/s
Main Memory	$\sim 100$ ns	$8$-$128$ GB	$\sim 50$ GB/s
SSD	$\sim 10$-$100$ $\mu$s	$256$ GB-$4$ TB	$\sim 3$-$7$ GB/s

Average Memory Access Time (AMAT) for a single cache level:

$\text{AMAT} = t_{\text{hit}} + m \times t_{\text{miss}}$

Where $t_{\text{hit}}$ is the cache access time, $m$ is the miss rate, and $t_{\text{miss}}$ is the miss penalty (time to fetch from the next level).

For a multi-level hierarchy:

$\text{AMAT} = t_{L1} + m_{L1} \times (t_{L2} + m_{L2} \times (t_{L3} + m_{L3} \times t_{\text{mem}}))$

Example with typical values ($m_{L1} = 0.05$, $m_{L2} = 0.2$, $m_{L3} = 0.3$):

$\text{AMAT} = 1 + 0.05 \times (5 + 0.2 \times (20 + 0.3 \times 100))$

$= 1 + 0.05 \times (5 + 0.2 \times 50)$

$= 1 + 0.05 \times 15 = 1.75 \text{ ns}$

Without caches, every access would take $\sim 100$ ns. The hierarchy reduces effective access time by roughly $57\times$ in this example.

Implementation

ALGORITHM CacheLookup(address, cache)
INPUT: address: memory address, cache: cache structure with S sets, E lines/set, B bytes/block
OUTPUT: (hit/miss, data)
BEGIN
  offsetBits <- LOG2(B)
  indexBits <- LOG2(S)

  offset <- address MOD B
  setIndex <- (address / B) MOD S
  tag <- address / (B * S)

  set <- cache.sets[setIndex]

  // Search all lines in the set
  FOR i <- 0 TO E - 1 DO
    line <- set.lines[i]
    IF line.valid = true AND line.tag = tag THEN
      // Cache hit
      UPDATE_ACCESS_TIME(set, i)
      RETURN (HIT, line.block[offset])
    END IF
  END FOR

  // Cache miss: must fetch from next level
  RETURN (MISS, FETCH_FROM_NEXT_LEVEL(address))
END

ALGORITHM HandleCacheMiss(address, cache, nextLevel)
INPUT: address: memory address, cache: cache structure, nextLevel: next memory level
OUTPUT: data at address
BEGIN
  offsetBits <- LOG2(cache.B)
  setIndex <- (address / cache.B) MOD cache.S
  tag <- address / (cache.B * cache.S)
  set <- cache.sets[setIndex]

  // Fetch entire block from next level
  blockStart <- address - (address MOD cache.B)
  block <- READ_BLOCK(nextLevel, blockStart, cache.B)

  // Find line to evict (LRU policy)
  victim <- NULL
  FOR i <- 0 TO cache.E - 1 DO
    IF set.lines[i].valid = false THEN
      victim <- set.lines[i]
      BREAK
    END IF
  END FOR

  IF victim = NULL THEN
    victim <- FIND_LRU_LINE(set)
    IF victim.dirty THEN
      // Write back dirty line to next level
      evictAddress <- (victim.tag * cache.S + setIndex) * cache.B
      WRITE_BLOCK(nextLevel, evictAddress, victim.block)
    END IF
  END IF

  // Install new block
  victim.tag <- tag
  victim.block <- block
  victim.valid <- true
  victim.dirty <- false
  UPDATE_ACCESS_TIME(set, victim)

  RETURN block[address MOD cache.B]
END

ALGORITHM ComputeAMAT(levels)
INPUT: levels: list of (hitTime, missRate) pairs from fastest to slowest, memoryTime: RAM access time
OUTPUT: average memory access time
BEGIN
  IF LENGTH(levels) = 0 THEN
    RETURN memoryTime
  END IF

  (hitTime, missRate) <- levels[0]
  remainingLevels <- levels[1:]

  RETURN hitTime + missRate * ComputeAMAT(remainingLevels)
END

ALGORITHM SimulateRowVsColumnAccess(matrix, N)
INPUT: matrix: N x N array in row-major order, N: dimension
OUTPUT: (rowMisses, colMisses)
BEGIN
  cacheSimRow <- NEW_CACHE_SIMULATOR()
  cacheSimCol <- NEW_CACHE_SIMULATOR()

  // Row-major traversal
  rowMisses <- 0
  FOR row <- 0 TO N - 1 DO
    FOR col <- 0 TO N - 1 DO
      address <- BASE + (row * N + col) * ELEMENT_SIZE
      result <- ACCESS(cacheSimRow, address)
      IF result = MISS THEN
        rowMisses <- rowMisses + 1
      END IF
    END FOR
  END FOR

  // Column-major traversal
  colMisses <- 0
  FOR col <- 0 TO N - 1 DO
    FOR row <- 0 TO N - 1 DO
      address <- BASE + (row * N + col) * ELEMENT_SIZE
      result <- ACCESS(cacheSimCol, address)
      IF result = MISS THEN
        colMisses <- colMisses + 1
      END IF
    END FOR
  END FOR

  RETURN (rowMisses, colMisses)
END

ALGORITHM PrefetchSequentialAccess(baseAddress, count, stride, cache)
INPUT: baseAddress: starting address, count: number of elements, stride: bytes between elements, cache: cache
OUTPUT: none (side effect: data loaded into cache)
BEGIN
  PREFETCH_DISTANCE <- 4  // prefetch 4 elements ahead

  FOR i <- 0 TO count - 1 DO
    currentAddr <- baseAddress + i * stride

    // Issue prefetch for future access
    IF i + PREFETCH_DISTANCE < count THEN
      prefetchAddr <- baseAddress + (i + PREFETCH_DISTANCE) * stride
      PREFETCH_INTO_CACHE(cache, prefetchAddr)
    END IF

    // Access current element (should be in cache from earlier prefetch)
    data <- CACHE_ACCESS(cache, currentAddr)
    PROCESS(data)
  END FOR
END

Real-World Applications

Matrix multiplication: naive implementations suffer from poor cache utilization; blocked (tiled) matrix multiplication processes sub-matrices that fit in cache, reducing misses from $O(N^3)$ to $O(N^3 / B)$ where $B$ is the block size
Database buffer pools: databases like PostgreSQL and MySQL maintain their own buffer pool in user space, implementing replacement policies (clock sweep, LRU-K) tuned for database access patterns rather than relying solely on the OS page cache
Web browsers: browsers use multi-level caching for web resources: memory cache for the current session, disk cache for persistent storage, and HTTP caching headers to coordinate with servers
CPU design: Intel and AMD processors use inclusive, exclusive, or non-inclusive cache hierarchies depending on the microarchitecture; the choice affects how cache coherence protocols work in multi-core systems
Data-oriented design in games: game engines like Unity's DOTS (Data-Oriented Technology Stack) restructure entity data from arrays-of-structs to structs-of-arrays to maximize cache line utilization during system updates
Sorting algorithms: cache-oblivious algorithms like funnel sort achieve optimal cache performance without knowing the cache size, by recursively dividing the problem until sub-problems fit in cache

Key Takeaways

The memory hierarchy exploits locality of reference to bridge the speed gap between fast CPUs and slow main memory, with each level trading size for speed
Caches work because programs exhibit temporal locality (reusing data) and spatial locality (accessing nearby data); cache lines of 64 bytes capture spatial locality automatically
Average Memory Access Time (AMAT) depends on hit rates at each level: $\text{AMAT} = t_{\text{hit}} + m \times t_{\text{miss}}$; even small miss rate increases at L1 have large impacts because miss penalties are high
Cache associativity trades lookup complexity for fewer conflict misses: direct-mapped is fastest but most prone to conflicts, while higher associativity reduces conflicts at the cost of more comparisons
Data access patterns matter enormously: row-major traversal of a 2D array can be 10x or more faster than column-major traversal due to spatial locality and cache line utilization
Write-back policies reduce memory traffic compared to write-through, which is why most modern L1 and L2 caches use write-back with dirty bits

2021-03-07

OS Concepts: Processes, Threads, and Scheduling

How operating systems manage concurrent execution through processes, threads, context switching, and scheduling algorithms that balance fairness, throughput, and responsiveness.

2021-03-08

Virtual Memory: Page Tables, TLBs, and Page Faults

How operating systems use virtual memory to give each process its own address space, translate virtual addresses to physical ones, and handle memory that exceeds physical RAM.

2021-03-09

Networking Fundamentals: TCP/IP, UDP, and DNS

How data travels across networks using the TCP/IP stack, the trade-offs between TCP and UDP, how DNS resolves domain names, and the basics of socket programming.