Hash Tables: Constant-Time Lookup
How hash functions, collision resolution, and dynamic resizing combine to give hash tables their O(1) average-case performance.
Terminology
- Hash Function
- A function that maps an input key (of arbitrary size) to a fixed-size integer, called the hash code. A good hash function distributes keys uniformly across the output range.
- Bucket
- A slot in the hash table's internal array. Each bucket stores one or more key-value pairs that map to the same index.
- Collision
- When two distinct keys produce the same bucket index after hashing. Every hash table must have a strategy for handling collisions.
- Load Factor ($\alpha$)
- The ratio of stored entries to total buckets: $\alpha = n / m$. A higher load factor means more collisions and slower lookups.
- Chaining
- A collision resolution strategy where each bucket holds a linked list (or other collection) of all entries that hash to that index.
- Open Addressing
- A collision resolution strategy where all entries live directly in the bucket array. On collision, the table probes subsequent slots using a deterministic sequence.
- Rehashing
- The process of allocating a larger bucket array and re-inserting all existing entries when the load factor exceeds a threshold.
- Amortized Analysis
- A technique for averaging the cost of expensive operations (like rehashing) over a sequence of cheap operations, showing that the average cost per operation remains low.
What & Why
A hash table is a data structure that maps keys to values with average-case O(1) lookup, insertion, and deletion. It achieves this by using a hash function to convert each key into an array index, jumping directly to where the value is stored instead of searching through elements sequentially.
Why does this matter? Nearly every program needs fast key-value lookups. Databases use hash indexes for equality queries. Programming languages implement objects, dictionaries, and sets with hash tables under the hood. Caches, routers, and compilers all rely on hashing for performance-critical paths.
Key properties:
- Average
O(1)operations: lookup, insert, and delete are constant time when the hash function distributes keys well - Key-value storage: associates arbitrary keys with values, unlike arrays which use integer indices
- Unordered: elements have no inherent ordering (unlike BSTs or sorted arrays)
- Space-time trade-off: uses extra memory (empty buckets) to achieve speed
How It Works
The Core Idea
A hash table maintains an internal array of $m$ buckets. To store a key-value pair $(k, v)$:
- Compute the hash code: $h = \text{hash}(k)$
- Map to a bucket index: $i = h \mod m$
- Store $(k, v)$ at bucket $i$, handling collisions if the bucket is occupied
To retrieve a value by key, repeat steps 1-2 and then search within that bucket.
In this example, "bob" and "dave" both hash to bucket 3, creating a collision. The chaining strategy stores both entries in that bucket's list.
Collision Resolution: Chaining
Each bucket points to a linked list of entries. On collision, the new entry is appended to the list. Lookup scans the list at the target bucket.
Average chain length equals the load factor $\alpha$. When $\alpha$ stays below a threshold (commonly 0.75), chains remain short and lookups stay fast.
Collision Resolution: Open Addressing
All entries live in the bucket array itself. When a collision occurs at index $i$, the table probes a sequence of alternative indices:
- Linear probing: try $i+1, i+2, i+3, \ldots$ (simple but causes clustering)
- Quadratic probing: try $i+1^2, i+2^2, i+3^2, \ldots$ (reduces primary clustering)
- Double hashing: try $i + j \cdot h_2(k)$ for $j = 1, 2, 3, \ldots$ using a second hash function (best distribution)
Rehashing
When the load factor exceeds a threshold, the table allocates a new array (typically $2m$ buckets) and re-inserts every entry. This is $O(n)$ for a single rehash, but amortized over all insertions it adds only $O(1)$ per insert, the same doubling argument used for dynamic arrays.
Complexity Analysis
| Operation | Average | Worst Case | Notes |
|---|---|---|---|
| Lookup | $O(1)$ | $O(n)$ | Worst case: all keys collide |
| Insert | $O(1)$ amortized | $O(n)$ | Rehashing costs $O(n)$ |
| Delete | $O(1)$ | $O(n)$ | Open addressing needs tombstones |
| Space | $O(n)$ | $O(n)$ | Plus empty bucket overhead |
Expected Lookup Cost (Chaining)
With a uniform hash function and load factor $\alpha = n/m$, the expected number of comparisons for a successful search is:
For an unsuccessful search (key not found):
Keeping $\alpha \leq 0.75$ means at most ~1.375 comparisons on average for a hit.
Implementation
Hash Table with Chaining (Pseudocode)
STRUCTURE HashTable:
buckets ← array of CAPACITY empty lists
count ← 0
loadThreshold ← 0.75
FUNCTION hash(table, key):
h ← 0
FOR EACH char IN toString(key):
h ← (h * 31 + charCode(char)) MOD MAX_INT
RETURN abs(h) MOD length(table.buckets)
FUNCTION set(table, key, value):
IF table.count / length(table.buckets) >= table.loadThreshold THEN
rehash(table)
idx ← hash(table, key)
chain ← table.buckets[idx]
FOR EACH pair IN chain:
IF pair.key = key THEN
pair.value ← value
RETURN
APPEND (key, value) TO chain
table.count ← table.count + 1
FUNCTION get(table, key):
idx ← hash(table, key)
FOR EACH pair IN table.buckets[idx]:
IF pair.key = key THEN RETURN pair.value
RETURN NOT_FOUND
FUNCTION delete(table, key):
idx ← hash(table, key)
chain ← table.buckets[idx]
FOR i ← 0 TO length(chain) - 1:
IF chain[i].key = key THEN
REMOVE chain[i]
table.count ← table.count - 1
RETURN true
RETURN false
FUNCTION rehash(table):
oldBuckets ← table.buckets
table.buckets ← array of (length(oldBuckets) * 2) empty lists
table.count ← 0
FOR EACH chain IN oldBuckets:
FOR EACH (k, v) IN chain:
set(table, k, v)
Real-World Applications
- Language runtimes: JavaScript objects, Python dicts, Java HashMaps, and Go maps are all hash table implementations
- Database indexing: hash indexes enable $O(1)$ equality lookups (e.g., PostgreSQL hash indexes)
- Caching: in-memory caches like Memcached and Redis use hash tables as their core data structure
- Compilers: symbol tables map variable names to their types and memory locations using hash tables
- Networking: connection tracking tables in firewalls and NAT devices hash on (src IP, dst IP, port) tuples
- Deduplication: detecting duplicate records, files, or requests by hashing content and checking for matches
Key Takeaways
- Hash tables provide average $O(1)$ lookup, insert, and delete by converting keys to array indices via a hash function
- Collisions are inevitable; chaining and open addressing are the two main resolution strategies
- The load factor $\alpha$ controls the trade-off between memory usage and performance; keeping $\alpha \leq 0.75$ is a common threshold
- Rehashing doubles the table size and re-inserts all entries, giving amortized $O(1)$ insertion cost
- Worst-case $O(n)$ occurs when all keys collide, which is why choosing a good hash function matters