Notes/Hash Tables: Constant-Time Lookup

Hash Tables: Constant-Time Lookup

How hash functions, collision resolution, and dynamic resizing combine to give hash tables their O(1) average-case performance.

2021-01-09AI-Synthesized from Personal Notes

CS FundamentalsHash TablesData Structures

Terminology

Hash Function: A function that maps an input key (of arbitrary size) to a fixed-size integer, called the hash code. A good hash function distributes keys uniformly across the output range.
Bucket: A slot in the hash table's internal array. Each bucket stores one or more key-value pairs that map to the same index.
Collision: When two distinct keys produce the same bucket index after hashing. Every hash table must have a strategy for handling collisions.
Load Factor ($\alpha$): The ratio of stored entries to total buckets: $\alpha = n / m$. A higher load factor means more collisions and slower lookups.
Chaining: A collision resolution strategy where each bucket holds a linked list (or other collection) of all entries that hash to that index.
Open Addressing: A collision resolution strategy where all entries live directly in the bucket array. On collision, the table probes subsequent slots using a deterministic sequence.
Rehashing: The process of allocating a larger bucket array and re-inserting all existing entries when the load factor exceeds a threshold.
Amortized Analysis: A technique for averaging the cost of expensive operations (like rehashing) over a sequence of cheap operations, showing that the average cost per operation remains low.

What & Why

A hash table is a data structure that maps keys to values with average-case $O(1)$ lookup, insertion, and deletion. It achieves this by using a hash function to convert each key into an array index, jumping directly to where the value is stored instead of searching through elements sequentially.

Why does this matter? Nearly every program needs fast key-value lookups. Databases use hash indexes for equality queries. Programming languages implement objects, dictionaries, and sets with hash tables under the hood. Caches, routers, and compilers all rely on hashing for performance-critical paths.

Key properties:

Average $O(1)$ operations: lookup, insert, and delete are constant time when the hash function distributes keys well
Key-value storage: associates arbitrary keys with values, unlike arrays which use integer indices
Unordered: elements have no inherent ordering (unlike BSTs or sorted arrays)
Space-time trade-off: uses extra memory (empty buckets) to achieve speed

How It Works

The Core Idea

A hash table maintains an internal array of $m$ buckets. To store a key-value pair $(k, v)$:

Compute the hash code: $h = \text{hash}(k)$
Map to a bucket index: $i = h \mod m$
Store $(k, v)$ at bucket $i$, handling collisions if the bucket is occupied

To retrieve a value by key, repeat steps 1-2 and then search within that bucket.

In this example, "bob" and "dave" both hash to bucket 3, creating a collision. The chaining strategy stores both entries in that bucket's list.

Collision Resolution: Chaining

Each bucket points to a linked list of entries. On collision, the new entry is appended to the list. Lookup scans the list at the target bucket.

Average chain length equals the load factor $\alpha$. When $\alpha$ stays below a threshold (commonly 0.75), chains remain short and lookups stay fast.

Collision Resolution: Open Addressing

All entries live in the bucket array itself. When a collision occurs at index $i$, the table probes a sequence of alternative indices:

Linear probing: try $i+1, i+2, i+3, \ldots$ (simple but causes clustering)
Quadratic probing: try $i+1^2, i+2^2, i+3^2, \ldots$ (reduces primary clustering)
Double hashing: try $i + j \cdot h_2(k)$ for $j = 1, 2, 3, \ldots$ using a second hash function (best distribution)

Rehashing

When the load factor exceeds a threshold, the table allocates a new array (typically $2m$ buckets) and re-inserts every entry. This is $O(n)$ for a single rehash, but amortized over all insertions it adds only $O(1)$ per insert, the same doubling argument used for dynamic arrays.

Complexity Analysis

Operation	Average	Worst Case	Notes
Lookup	$O(1)$	$O(n)$	Worst case: all keys collide
Insert	$O(1)$ amortized	$O(n)$	Rehashing costs $O(n)$
Delete	$O(1)$	$O(n)$	Open addressing needs tombstones
Space	$O(n)$	$O(n)$	Plus empty bucket overhead

Expected Lookup Cost (Chaining)

With a uniform hash function and load factor $\alpha = n/m$, the expected number of comparisons for a successful search is:

$E[\text{comparisons}] = 1 + \frac{\alpha}{2} = 1 + \frac{n}{2m}$

For an unsuccessful search (key not found):

$E[\text{comparisons}] = \alpha = \frac{n}{m}$

Keeping $\alpha \leq 0.75$ means at most ~1.375 comparisons on average for a hit.

Implementation

Hash Table with Chaining (Pseudocode)

STRUCTURE HashTable:
    buckets ← array of CAPACITY empty lists
    count ← 0
    loadThreshold ← 0.75

FUNCTION hash(table, key):
    h ← 0
    FOR EACH char IN toString(key):
        h ← (h * 31 + charCode(char)) MOD MAX_INT
    RETURN abs(h) MOD length(table.buckets)

FUNCTION set(table, key, value):
    IF table.count / length(table.buckets) >= table.loadThreshold THEN
        rehash(table)
    idx ← hash(table, key)
    chain ← table.buckets[idx]
    FOR EACH pair IN chain:
        IF pair.key = key THEN
            pair.value ← value
            RETURN
    APPEND (key, value) TO chain
    table.count ← table.count + 1

FUNCTION get(table, key):
    idx ← hash(table, key)
    FOR EACH pair IN table.buckets[idx]:
        IF pair.key = key THEN RETURN pair.value
    RETURN NOT_FOUND

FUNCTION delete(table, key):
    idx ← hash(table, key)
    chain ← table.buckets[idx]
    FOR i ← 0 TO length(chain) - 1:
        IF chain[i].key = key THEN
            REMOVE chain[i]
            table.count ← table.count - 1
            RETURN true
    RETURN false

FUNCTION rehash(table):
    oldBuckets ← table.buckets
    table.buckets ← array of (length(oldBuckets) * 2) empty lists
    table.count ← 0
    FOR EACH chain IN oldBuckets:
        FOR EACH (k, v) IN chain:
            set(table, k, v)

Real-World Applications

Language runtimes: JavaScript objects, Python dicts, Java HashMaps, and Go maps are all hash table implementations
Database indexing: hash indexes enable $O(1)$ equality lookups (e.g., PostgreSQL hash indexes)
Caching: in-memory caches like Memcached and Redis use hash tables as their core data structure
Compilers: symbol tables map variable names to their types and memory locations using hash tables
Networking: connection tracking tables in firewalls and NAT devices hash on (src IP, dst IP, port) tuples
Deduplication: detecting duplicate records, files, or requests by hashing content and checking for matches

Key Takeaways

Hash tables provide average $O(1)$ lookup, insert, and delete by converting keys to array indices via a hash function
Collisions are inevitable; chaining and open addressing are the two main resolution strategies
The load factor $\alpha$ controls the trade-off between memory usage and performance; keeping $\alpha \leq 0.75$ is a common threshold
Rehashing doubles the table size and re-inserts all entries, giving amortized $O(1)$ insertion cost
Worst-case $O(n)$ occurs when all keys collide, which is why choosing a good hash function matters

2021-01-11

Trees: Hierarchical Data and Binary Search Trees

Understanding tree structures, binary trees, BSTs, and the three classic traversal orders that unlock recursive thinking.

2021-01-13

Heaps and Priority Queues: Always Know the Extreme

How binary heaps maintain the min or max element at the top in O(log n) time, powering priority queues, heap sort, and scheduling systems.

2021-01-15

Graphs: Representation, BFS, and DFS

How to represent graphs with adjacency lists and matrices, and how breadth-first and depth-first search explore them systematically.