Memory Hierarchy: From Registers to Disk
How computers organize memory into layers of increasing size and decreasing speed, and why locality of reference makes this hierarchy effective.
Terminology
- Memory hierarchy: the organization of storage in a computer system into multiple levels, each trading off speed, size, and cost; faster levels are smaller and more expensive per byte
- Register: the fastest and smallest storage, located inside the CPU core; holds data currently being operated on by instructions
- SRAM (Static Random-Access Memory): a type of memory that uses flip-flop circuits to store each bit; fast and does not need refreshing, used for CPU caches
- DRAM (Dynamic Random-Access Memory): a type of memory that stores each bit as a charge in a capacitor; slower than SRAM and requires periodic refreshing, used for main memory (RAM)
- Cache line (cache block): the smallest unit of data that can be transferred between cache levels, typically 64 bytes on modern processors
- Temporal locality: the tendency of a program to access the same memory locations repeatedly within a short time period
- Spatial locality: the tendency of a program to access memory locations near recently accessed locations
- Hit rate: the fraction of memory accesses that are satisfied by a given cache level, expressed as a value between 0 and 1
- Miss rate: the fraction of memory accesses not found in a given cache level; equal to $1 - \text{hit rate}$
- Miss penalty: the additional time required to fetch data from a lower (slower) level of the hierarchy when a cache miss occurs
- Write-back: a cache write policy where modified data is written to the next level only when the cache line is evicted
- Write-through: a cache write policy where every write to the cache is simultaneously written to the next level
- LRU (Least Recently Used): a cache replacement policy that evicts the cache line that has not been accessed for the longest time
- Associativity: the number of cache lines in a set where a given memory address can be placed; higher associativity reduces conflict misses but increases lookup time
- Direct-mapped cache: a cache where each memory address maps to exactly one cache line (1-way associative)
- Set-associative cache: a cache where each memory address maps to a set of cache lines; $n$-way means each set has $n$ lines
- Fully associative cache: a cache where a memory address can be placed in any cache line
- TLB (Translation Lookaside Buffer): a specialized cache that stores recent virtual-to-physical address translations to speed up memory access
- Prefetching: a technique where the hardware or software predicts future memory accesses and loads data into the cache before it is requested
- Bandwidth: the rate at which data can be transferred between memory levels, measured in bytes per second
- Latency: the time delay from requesting data to receiving it
What & Why
The fundamental problem in computer architecture is that processors are fast and memory is slow. A modern CPU can execute an instruction every fraction of a nanosecond, but fetching a single byte from main memory takes roughly 100 nanoseconds. That is a 100x to 300x speed gap. If the CPU had to wait for main memory on every access, it would spend most of its time idle.
The memory hierarchy solves this by placing small, fast memories close to the CPU and large, slow memories farther away. Registers sit inside the CPU core and respond in a single cycle. L1 cache is a few kilobytes and responds in 1-2 cycles. L2 cache is hundreds of kilobytes at 4-10 cycles. L3 cache is megabytes at 20-40 cycles. Main memory (DRAM) is gigabytes at 100-300 cycles. Disk storage is terabytes at millions of cycles.
This works because of locality. Programs do not access memory randomly. They tend to reuse the same data (temporal locality) and access nearby addresses (spatial locality). A loop that iterates over an array accesses the same instructions repeatedly and reads array elements sequentially. The cache captures this working set, and the CPU rarely needs to wait for slow memory.
Understanding the memory hierarchy is essential for writing performant software. The difference between cache-friendly and cache-hostile code can be 10x or more in execution time, even for the same algorithmic complexity.
How It Works
The Hierarchy Pyramid
How Caches Work
When the CPU needs data at a memory address, it first checks the L1 cache. The address is split into three parts:
- Tag: identifies which block of memory is stored in this cache line
- Set index: determines which set in the cache to look in
- Block offset: identifies the specific byte within the cache line
For a cache with S sets, E lines per set (associativity), and B bytes per block:
The set index uses \log_2(S) bits, the block offset uses \log_2(B) bits, and the remaining address bits form the tag.
Direct-mapped (E = 1): Each address maps to exactly one cache line. Fast lookup but vulnerable to conflict misses when two frequently accessed addresses map to the same line.
Set-associative (E = 2, 4, 8): Each address maps to a set of E lines. The cache checks all lines in the set in parallel. Reduces conflict misses at the cost of slightly more complex hardware.
Fully associative (S = 1): Any address can go in any line. Eliminates conflict misses but requires comparing the tag against every line, which is expensive for large caches.
Cache Replacement Policies
When a cache set is full and a new block must be loaded, the cache must evict an existing line. Common policies:
LRU (Least Recently Used): Evict the line that has not been accessed for the longest time. Good approximation of optimal replacement but expensive to implement exactly for high associativity.
Pseudo-LRU: Approximates LRU using a binary tree of bits. Each access updates the tree to point away from the accessed line. Much cheaper to implement than true LRU.
Random: Evict a random line. Surprisingly competitive with LRU in practice and trivial to implement.
Write Policies
Write-back: Modified data stays in the cache and is only written to the next level when evicted. Reduces memory traffic but requires a "dirty" bit per line to track modifications.
Write-through: Every write goes to both the cache and the next level immediately. Simpler but generates more memory traffic. Often combined with a write buffer to avoid stalling the CPU.
Locality in Practice
Consider iterating over a 2D array stored in row-major order (C/C++ default). Accessing elements row by row exploits spatial locality because consecutive elements are adjacent in memory. Accessing elements column by column causes a cache miss on nearly every access because consecutive column elements are separated by an entire row's worth of bytes.
For an N \times N array with elements of size s bytes and cache lines of B bytes:
With 4-byte integers and 64-byte cache lines, row-major access misses once every 16 accesses (\frac{4}{64}), while column-major access can miss on every single access.
Complexity Analysis
| Memory Level | Typical Latency | Typical Size | Bandwidth |
|---|---|---|---|
| Registers | $\sim 0.3$ ns | $\sim 1$ KB | TB/s |
| L1 Cache | $\sim 1$ ns | $32$-$64$ KB | $\sim 1$ TB/s |
| L2 Cache | $\sim 3$-$10$ ns | $256$ KB-$1$ MB | $\sim 500$ GB/s |
| L3 Cache | $\sim 10$-$40$ ns | $4$-$64$ MB | $\sim 200$ GB/s |
| Main Memory | $\sim 100$ ns | $8$-$128$ GB | $\sim 50$ GB/s |
| SSD | $\sim 10$-$100$ $\mu$s | $256$ GB-$4$ TB | $\sim 3$-$7$ GB/s |
Average Memory Access Time (AMAT) for a single cache level:
Where $t_{\text{hit}}$ is the cache access time, $m$ is the miss rate, and $t_{\text{miss}}$ is the miss penalty (time to fetch from the next level).
For a multi-level hierarchy:
Example with typical values ($m_{L1} = 0.05$, $m_{L2} = 0.2$, $m_{L3} = 0.3$):
Without caches, every access would take $\sim 100$ ns. The hierarchy reduces effective access time by roughly $57\times$ in this example.
Implementation
ALGORITHM CacheLookup(address, cache)
INPUT: address: memory address, cache: cache structure with S sets, E lines/set, B bytes/block
OUTPUT: (hit/miss, data)
BEGIN
offsetBits <- LOG2(B)
indexBits <- LOG2(S)
offset <- address MOD B
setIndex <- (address / B) MOD S
tag <- address / (B * S)
set <- cache.sets[setIndex]
// Search all lines in the set
FOR i <- 0 TO E - 1 DO
line <- set.lines[i]
IF line.valid = true AND line.tag = tag THEN
// Cache hit
UPDATE_ACCESS_TIME(set, i)
RETURN (HIT, line.block[offset])
END IF
END FOR
// Cache miss: must fetch from next level
RETURN (MISS, FETCH_FROM_NEXT_LEVEL(address))
END
ALGORITHM HandleCacheMiss(address, cache, nextLevel)
INPUT: address: memory address, cache: cache structure, nextLevel: next memory level
OUTPUT: data at address
BEGIN
offsetBits <- LOG2(cache.B)
setIndex <- (address / cache.B) MOD cache.S
tag <- address / (cache.B * cache.S)
set <- cache.sets[setIndex]
// Fetch entire block from next level
blockStart <- address - (address MOD cache.B)
block <- READ_BLOCK(nextLevel, blockStart, cache.B)
// Find line to evict (LRU policy)
victim <- NULL
FOR i <- 0 TO cache.E - 1 DO
IF set.lines[i].valid = false THEN
victim <- set.lines[i]
BREAK
END IF
END FOR
IF victim = NULL THEN
victim <- FIND_LRU_LINE(set)
IF victim.dirty THEN
// Write back dirty line to next level
evictAddress <- (victim.tag * cache.S + setIndex) * cache.B
WRITE_BLOCK(nextLevel, evictAddress, victim.block)
END IF
END IF
// Install new block
victim.tag <- tag
victim.block <- block
victim.valid <- true
victim.dirty <- false
UPDATE_ACCESS_TIME(set, victim)
RETURN block[address MOD cache.B]
END
ALGORITHM ComputeAMAT(levels)
INPUT: levels: list of (hitTime, missRate) pairs from fastest to slowest, memoryTime: RAM access time
OUTPUT: average memory access time
BEGIN
IF LENGTH(levels) = 0 THEN
RETURN memoryTime
END IF
(hitTime, missRate) <- levels[0]
remainingLevels <- levels[1:]
RETURN hitTime + missRate * ComputeAMAT(remainingLevels)
END
ALGORITHM SimulateRowVsColumnAccess(matrix, N)
INPUT: matrix: N x N array in row-major order, N: dimension
OUTPUT: (rowMisses, colMisses)
BEGIN
cacheSimRow <- NEW_CACHE_SIMULATOR()
cacheSimCol <- NEW_CACHE_SIMULATOR()
// Row-major traversal
rowMisses <- 0
FOR row <- 0 TO N - 1 DO
FOR col <- 0 TO N - 1 DO
address <- BASE + (row * N + col) * ELEMENT_SIZE
result <- ACCESS(cacheSimRow, address)
IF result = MISS THEN
rowMisses <- rowMisses + 1
END IF
END FOR
END FOR
// Column-major traversal
colMisses <- 0
FOR col <- 0 TO N - 1 DO
FOR row <- 0 TO N - 1 DO
address <- BASE + (row * N + col) * ELEMENT_SIZE
result <- ACCESS(cacheSimCol, address)
IF result = MISS THEN
colMisses <- colMisses + 1
END IF
END FOR
END FOR
RETURN (rowMisses, colMisses)
END
ALGORITHM PrefetchSequentialAccess(baseAddress, count, stride, cache)
INPUT: baseAddress: starting address, count: number of elements, stride: bytes between elements, cache: cache
OUTPUT: none (side effect: data loaded into cache)
BEGIN
PREFETCH_DISTANCE <- 4 // prefetch 4 elements ahead
FOR i <- 0 TO count - 1 DO
currentAddr <- baseAddress + i * stride
// Issue prefetch for future access
IF i + PREFETCH_DISTANCE < count THEN
prefetchAddr <- baseAddress + (i + PREFETCH_DISTANCE) * stride
PREFETCH_INTO_CACHE(cache, prefetchAddr)
END IF
// Access current element (should be in cache from earlier prefetch)
data <- CACHE_ACCESS(cache, currentAddr)
PROCESS(data)
END FOR
END
Real-World Applications
- Matrix multiplication: naive implementations suffer from poor cache utilization; blocked (tiled) matrix multiplication processes sub-matrices that fit in cache, reducing misses from $O(N^3)$ to $O(N^3 / B)$ where $B$ is the block size
- Database buffer pools: databases like PostgreSQL and MySQL maintain their own buffer pool in user space, implementing replacement policies (clock sweep, LRU-K) tuned for database access patterns rather than relying solely on the OS page cache
- Web browsers: browsers use multi-level caching for web resources: memory cache for the current session, disk cache for persistent storage, and HTTP caching headers to coordinate with servers
- CPU design: Intel and AMD processors use inclusive, exclusive, or non-inclusive cache hierarchies depending on the microarchitecture; the choice affects how cache coherence protocols work in multi-core systems
- Data-oriented design in games: game engines like Unity's DOTS (Data-Oriented Technology Stack) restructure entity data from arrays-of-structs to structs-of-arrays to maximize cache line utilization during system updates
- Sorting algorithms: cache-oblivious algorithms like funnel sort achieve optimal cache performance without knowing the cache size, by recursively dividing the problem until sub-problems fit in cache
Key Takeaways
- The memory hierarchy exploits locality of reference to bridge the speed gap between fast CPUs and slow main memory, with each level trading size for speed
- Caches work because programs exhibit temporal locality (reusing data) and spatial locality (accessing nearby data); cache lines of 64 bytes capture spatial locality automatically
- Average Memory Access Time (AMAT) depends on hit rates at each level: $\text{AMAT} = t_{\text{hit}} + m \times t_{\text{miss}}$; even small miss rate increases at L1 have large impacts because miss penalties are high
- Cache associativity trades lookup complexity for fewer conflict misses: direct-mapped is fastest but most prone to conflicts, while higher associativity reduces conflicts at the cost of more comparisons
- Data access patterns matter enormously: row-major traversal of a 2D array can be 10x or more faster than column-major traversal due to spatial locality and cache line utilization
- Write-back policies reduce memory traffic compared to write-through, which is why most modern L1 and L2 caches use write-back with dirty bits