Notes/Bloom Filters: Probabilistic Membership Testing

Bloom Filters: Probabilistic Membership Testing

How Bloom filters use multiple hash functions and a bit array to answer 'is this element in the set?' with extreme space efficiency, at the cost of occasional false positives.

2021-02-05AI-Synthesized from Personal Notes

Advanced Data StructuresBloom FiltersData Structures

Terminology

Bloom filter: a space-efficient probabilistic data structure that tests whether an element is a member of a set; it can produce false positives but never false negatives
Bit array: a fixed-size array where each position holds a single bit (0 or 1)
Hash function: a function that maps an input to a fixed-size integer; a Bloom filter uses $k$ independent hash functions
False positive: reporting that an element is in the set when it actually is not; caused by hash collisions filling the same bit positions
False negative: reporting that an element is not in the set when it actually is; Bloom filters guarantee this never happens
False positive rate (FPR): the probability that a query for a non-member returns "possibly in set"
Membership query: asking "is element $x$ in the set?"
Saturation: the state where most bits in the array are set to 1, causing a high false positive rate
Counting Bloom filter: a variant that uses counters instead of bits, enabling element deletion

What & Why

A Bloom filter answers one question: "Is this element in the set?" The answer is either "definitely not" or "probably yes." It never produces false negatives, meaning if it says an element is absent, you can trust that completely. But it can produce false positives, saying an element might be present when it is not.

Why accept false positives? Because the space savings are enormous. A Bloom filter can represent a set of millions of elements in just a few kilobytes. A hash set storing the same elements would need megabytes. When you can tolerate a small error rate (say 1%), a Bloom filter uses roughly 10 bits per element regardless of element size.

Typical use pattern: use the Bloom filter as a fast, cheap first check. If it says "not in set," you are done. If it says "maybe in set," do the expensive lookup (disk read, network call, database query) to confirm. This avoids the expensive operation for the vast majority of negative lookups.

How It Works

Structure

A Bloom filter consists of a bit array of $m$ bits (all initialized to 0) and $k$ independent hash functions, each mapping an element to one of the $m$ bit positions.

Insert

To add element $x$: compute $h_1(x), h_2(x), \ldots, h_k(x)$ and set each of those bit positions to 1.

Query

To check if $x$ is in the set: compute the same $k$ hash positions. If all $k$ bits are 1, return "probably yes." If any bit is 0, return "definitely not."

Why No False Negatives?

When you insert an element, its $k$ bit positions are set to 1 and never cleared. So when you query for that same element, those same $k$ positions are guaranteed to still be 1. The filter can only say "definitely not" if at least one bit is 0, which cannot happen for an inserted element.

Why False Positives?

Different elements can hash to overlapping bit positions. If element A sets bits {1, 3, 6} and element B sets bits {3, 9, 11}, then a query for element C with hash positions {1, 6, 9} finds all bits set to 1 even though C was never inserted.

Complexity Analysis

For a Bloom filter with $m$ bits, $k$ hash functions, and $n$ inserted elements:

$\text{FPR} \approx \left(1 - e^{-kn/m}\right)^k$

The optimal number of hash functions that minimizes the false positive rate is:

$k_{\text{opt}} = \frac{m}{n} \ln 2$

With optimal $k$, the false positive rate simplifies to:

$\text{FPR} \approx \left(\frac{1}{2}\right)^k = (0.6185)^{m/n}$

Operation	Time	Notes
Insert	$O(k)$	Compute $k$ hashes, set $k$ bits
Query	$O(k)$	Compute $k$ hashes, check $k$ bits
Delete	Not supported	Clearing a bit could cause false negatives; use counting variant
Space	$O(m)$ bits	For 1% FPR: ~9.6 bits per element

Practical sizing: for a 1% false positive rate, you need about 9.6 bits per element and 7 hash functions. For 0.1%, about 14.4 bits per element and 10 hash functions. This is dramatically less than storing the actual elements.

Implementation

STRUCTURE BloomFilter
  bits: bit array of size m, initialized to all 0s
  m: size of the bit array
  k: number of hash functions
END STRUCTURE

ALGORITHM BloomFilterInsert(filter, element)
INPUT: filter: BloomFilter, element: item to add
BEGIN
  FOR i <- 1 TO filter.k DO
    position <- Hash_i(element) MOD filter.m
    filter.bits[position] <- 1
  END FOR
END

ALGORITHM BloomFilterQuery(filter, element)
INPUT: filter: BloomFilter, element: item to check
OUTPUT: "possibly in set" or "definitely not in set"
BEGIN
  FOR i <- 1 TO filter.k DO
    position <- Hash_i(element) MOD filter.m
    IF filter.bits[position] = 0 THEN
      RETURN "definitely not in set"
    END IF
  END FOR
  RETURN "possibly in set"
END

ALGORITHM OptimalParameters(n, targetFPR)
INPUT: n: expected number of elements, targetFPR: desired false positive rate
OUTPUT: (m, k) optimal bit array size and hash count
BEGIN
  m <- CEILING(-n * ln(targetFPR) / (ln(2))^2)
  k <- ROUND(m / n * ln(2))
  RETURN (m, k)
END

Real-World Applications

Web caching: CDNs use Bloom filters to check if a URL has been cached before making an expensive origin fetch; Google Chrome used a Bloom filter to check URLs against a malware blacklist
Database query optimization: PostgreSQL, Cassandra, and HBase use Bloom filters to skip disk reads for rows that definitely do not exist in an SSTable or data block
Network routers: packet deduplication and loop detection use Bloom filters to track recently seen packet signatures without storing full packet data
Spell checkers: early spell checkers used Bloom filters to store dictionaries in limited memory
Distributed systems: nodes exchange Bloom filters to summarize their data, enabling efficient set reconciliation without transferring full datasets
Bitcoin: SPV (lightweight) clients use Bloom filters to request only relevant transactions from full nodes without revealing their exact addresses

Key Takeaways

Bloom filters answer "is $x$ in the set?" with zero false negatives and a tunable false positive rate
Space efficiency is the main advantage: ~10 bits per element for a 1% error rate, regardless of element size
Operations are $O(k)$ where $k$ is the number of hash functions, typically a small constant (7-10)
Standard Bloom filters do not support deletion; counting Bloom filters (using counters instead of bits) add this capability at the cost of more space
The optimal number of hash functions is $k = (m/n) \ln 2$, balancing between too few (high collision rate) and too many (too many bits set)

2021-02-06

LSM Trees: Write-Optimized Storage with Log-Structured Merge

How LSM Trees turn random writes into sequential I/O, powering the storage engines behind Cassandra, RocksDB, and LevelDB.

2021-02-07

Red-Black Trees: Self-Balancing BSTs with Color Invariants

How red-black trees use node coloring and rotations to guarantee O(log n) operations, powering standard library maps and sets across languages.

2021-02-01

B-Trees: Balanced Multi-Way Trees for Disk-Optimized Search

How B-Trees keep data sorted and balanced across multiple children per node, minimizing disk I/O for databases and file systems.