Notes/Tries: Prefix Trees for Autocomplete and String Matching

Tries: Prefix Trees for Autocomplete and String Matching

How tries store strings character by character in a tree structure, enabling prefix search, autocomplete, and fast string matching.

2021-02-03AI-Synthesized from Personal Notes

Advanced Data StructuresTriesData Structures

Terminology

Trie: a tree-shaped data structure where each node represents a single character, and paths from root to marked nodes spell out stored strings (the name comes from "retrieval")
Prefix: the beginning portion of a string; "app" is a prefix of "apple"
Root: the topmost node in the trie, representing the empty string
Edge: a link between a parent node and a child node, labeled with a character
Terminal node: a node marked to indicate that the path from the root to this node forms a complete stored string (also called an "end-of-word" marker)
Alphabet: the set of possible characters at each level, e.g., lowercase English letters (size 26) or ASCII (size 128)
Branching factor: the maximum number of children per node, equal to the alphabet size
Compressed trie (radix tree): a trie variant that merges chains of single-child nodes into one node storing a substring instead of a single character
Longest common prefix: the longest string that is a prefix of two or more given strings

What & Why

A trie is a tree where each path from the root to a terminal node represents a stored string. Instead of comparing entire keys like a BST or hash table, a trie processes one character at a time, branching at each level based on the next character in the key.

Why use a trie instead of a hash table? Hash tables give $O(1)$ average lookup for exact matches, but they cannot answer prefix queries. A trie can find all strings starting with "auto" in $O(k)$ time where $k$ is the prefix length, regardless of how many strings are stored. This makes tries the natural choice for autocomplete, spell checkers, IP routing tables, and any problem involving prefix matching.

Key properties:

Lookup, insert, and delete all run in $O(k)$ time where $k$ is the key length, independent of the number of stored keys
Prefix search is a first-class operation: walk the prefix path, then collect all descendants
No hash collisions, no need for a hash function
Space can be high if the alphabet is large and the trie is sparse, but compressed variants (radix trees) mitigate this

How It Works

Structure

Each node has up to $|\Sigma|$ children (one per character in the alphabet). A boolean flag marks whether the node is the end of a stored string.

Search

To search for a string, start at the root and follow the edge labeled with each successive character. If at any point the required edge does not exist, the string is not in the trie. If you reach the end of the string, check the terminal flag: if it is set, the string exists; otherwise, the string is only a prefix of some stored string.

Prefix Search

Walk the trie along the prefix characters. If you reach a valid node, perform a depth-first traversal from that node to collect all terminal descendants. Each descendant represents a stored string that starts with the given prefix.

Insertion

Walk the trie character by character. If an edge for the current character does not exist, create a new node and link it. After processing the last character, mark the node as terminal.

Deletion

Find the terminal node for the string and unmark it. Then, walk back up and remove any nodes that are no longer part of another stored string (nodes with no children and no terminal flag).

Complexity Analysis

Let $k$ be the length of the key and $|\Sigma|$ be the alphabet size.

Operation	Time	Notes
Search	$O(k)$	Independent of number of stored strings
Insert	$O(k)$	Create at most $k$ new nodes
Delete	$O(k)$	Walk down, then clean up
Prefix search ($p$ results)	$O(k + p \cdot L)$	$k$ to reach prefix node, then DFS over $p$ results of avg length $L$
Space	$O(n \cdot k \cdot \|\Sigma\|)$ worst case	Each node has $\|\Sigma\|$ child pointers; shared prefixes reduce actual usage

The space cost is the main drawback. With a 26-letter alphabet, each node carries 26 pointers. For sparse tries, using a hash map or sorted array per node instead of a fixed array reduces space at the cost of slightly slower access.

Compressed tries (radix trees) collapse chains of single-child nodes into one node storing a substring, reducing both node count and space.

Implementation

STRUCTURE TrieNode
  children: array of TrieNode, size |alphabet|, initialized to NULL
  isTerminal: boolean, initialized to FALSE
END STRUCTURE

ALGORITHM TrieInsert(root, word)
INPUT: root: TrieNode, word: string to insert
BEGIN
  node <- root
  FOR EACH character c IN word DO
    index <- CharToIndex(c)
    IF node.children[index] IS NULL THEN
      node.children[index] <- new TrieNode()
    END IF
    node <- node.children[index]
  END FOR
  node.isTerminal <- TRUE
END

ALGORITHM TrieSearch(root, word)
INPUT: root: TrieNode, word: string to search
OUTPUT: TRUE if word exists, FALSE otherwise
BEGIN
  node <- root
  FOR EACH character c IN word DO
    index <- CharToIndex(c)
    IF node.children[index] IS NULL THEN
      RETURN FALSE
    END IF
    node <- node.children[index]
  END FOR
  RETURN node.isTerminal
END

ALGORITHM TriePrefixSearch(root, prefix)
INPUT: root: TrieNode, prefix: string
OUTPUT: list of all stored strings starting with prefix
BEGIN
  node <- root
  FOR EACH character c IN prefix DO
    index <- CharToIndex(c)
    IF node.children[index] IS NULL THEN
      RETURN empty list
    END IF
    node <- node.children[index]
  END FOR

  results <- empty list
  CollectAll(node, prefix, results)
  RETURN results
END

ALGORITHM CollectAll(node, currentWord, results)
BEGIN
  IF node.isTerminal THEN
    Append(results, currentWord)
  END IF
  FOR EACH index IN 0 .. |alphabet| - 1 DO
    IF node.children[index] IS NOT NULL THEN
      CollectAll(node.children[index], currentWord + IndexToChar(index), results)
    END IF
  END FOR
END

Real-World Applications

Autocomplete: search engines and text editors use tries to suggest completions as you type; the prefix search operation returns all matching entries instantly
Spell checkers: dictionaries stored as tries enable fast lookup and "did you mean?" suggestions by exploring nearby branches
IP routing: routers use compressed tries (Patricia tries) to match IP address prefixes against routing table entries in $O(k)$ time
T9 predictive text: older phone keyboards mapped number sequences to words using a trie of the dictionary
Genome analysis: suffix tries index DNA sequences for fast substring matching across billions of base pairs
File systems: some file systems use trie-like structures to index directory paths for fast path resolution

Key Takeaways

Tries process strings character by character, giving $O(k)$ operations independent of the dataset size
Prefix search is the killer feature: no hash table or BST can match a trie's ability to find all strings with a given prefix
Space is the main trade-off; each node may carry $|\Sigma|$ pointers, most of which are null in sparse tries
Compressed tries (radix trees) collapse single-child chains to save space while preserving $O(k)$ performance
Tries are the backbone of autocomplete, spell checking, IP routing, and any system that needs fast prefix matching

2021-02-04

Skip Lists: Probabilistic Balancing with Layered Linked Lists

How skip lists use randomized multi-level linked lists to achieve O(log n) search without the complexity of tree rotations.

2021-02-05

Bloom Filters: Probabilistic Membership Testing

How Bloom filters use multiple hash functions and a bit array to answer 'is this element in the set?' with extreme space efficiency, at the cost of occasional false positives.

2021-02-06

LSM Trees: Write-Optimized Storage with Log-Structured Merge

How LSM Trees turn random writes into sequential I/O, powering the storage engines behind Cassandra, RocksDB, and LevelDB.